Paper deep dive
Beyond TVLA: Anderson-Darling Leakage Assessment for Neural Network Side-Channel Leakage Detection
Ján Mikulec, Jakub Breier, Xiaolu Hou
Abstract
Abstract:Test Vector Leakage Assessment (TVLA) based on Welch's $t$-test has become a standard tool for detecting side-channel leakage. However, its mean-based nature can limit sensitivity when leakage manifests primarily through higher-order distributional differences. As our experiments show, this property becomes especially crucial when it comes to evaluating neural network implementations. In this work, we propose Anderson--Darling Leakage Assessment (ADLA), a leakage detection framework that applies the two-sample Anderson--Darling test for leakage detection. Unlike TVLA, ADLA tests equality of the full cumulative distribution functions and does not rely on a purely mean-shift model. We evaluate ADLA on a multilayer perceptron (MLP) trained on MNIST and implemented on a ChipWhisperer-Husky evaluation platform. We consider protected implementations employing shuffling and random jitter countermeasures. Our results show that ADLA can provide improved leakage-detection sensitivity in protected implementations for a low number of traces compared to TVLA.
Tags
Links
- Source: https://arxiv.org/abs/2603.18647v1
- Canonical: https://arxiv.org/abs/2603.18647v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
38,789 characters extracted from source content.
Expand or collapse full text
Beyond TVLA: Anderson–Darling Leakage Assessment for Neural Network Side-Channel Leakage Detection Ján Mikulec Faculty of Informatics and Information technologies Slovak University of Technology Bratislava, Slovakia jan.mikulec@stuba.sk &Jakub Breier TTControl GmbH Vienna, Austria jbreier@jbreier.com &Xiaolu Hou Faculty of Informatics and Information technologies Slovak University of Technology Bratislava, Slovakia xiaolu.hou@stuba.sk Abstract Test Vector Leakage Assessment (TVLA) based on Welch’s t-test has become a standard tool for detecting side-channel leakage. However, its mean-based nature can limit sensitivity when leakage manifests primarily through higher-order distributional differences. As our experiments show, this property becomes especially crucial when it comes to evaluating neural network implementations. In this work, we propose Anderson–Darling Leakage Assessment (ADLA), a leakage detection framework that applies the two-sample Anderson–Darling test for leakage detection. Unlike TVLA, ADLA tests equality of the full cumulative distribution functions and does not rely on a purely mean-shift model. We evaluate ADLA on a multilayer perceptron (MLP) trained on MNIST and implemented on a ChipWhisperer-Husky evaluation platform. We consider protected implementations employing shuffling and random jitter countermeasures. Our results show that ADLA can provide improved leakage-detection sensitivity in protected implementations for a low number of traces compared to TVLA. 1 Introduction The deployment of neural networks on embedded and edge platforms has accelerated rapidly, driven by applications ranging from vision and biometrics to industrial monitoring and automotive control. While these deployments enable low-latency inference close to the data source, they also expose implementations to physical attacks [1]. In particular, side-channel analysis (SCA) [2] can exploit data-dependent variations in power consumption or electromagnetic emanations to infer sensitive information about intermediate computations, model parameters, or user inputs. As a result, practical countermeasures such as masking [11], shuffling [25], and jitter-based techniques [5] are increasingly considered when implementing machine-learning workloads on constrained devices. A widely adopted first step in evaluating side-channel resistance is leakage assessment, which aims to determine whether data-dependent leakage is present without committing to a specific attack strategy [16]. The de facto standard methodology is Test Vector Leakage Assessment (TVLA) [12]. TVLA is attractive due to its simplicity and its well-established thresholding practice, however, it is fundamentally a mean-based test. When countermeasures reduce or hide mean shifts (e.g., through shuffling or desynchronization), leakage may persist in the form of higher-order distributional differences that are less visible to a purely mean-sensitive statistic. This motivates the development of complementary leakage assessment techniques that can detect discrepancies beyond the first moment. In this work, we propose Anderson–Darling Leakage Assessment (ADLA), a leakage detection framework that leverages the two-sample Anderson–Darling test to compare leakage distributions arising from two controlled input conditions. In contrast to TVLA, which tests the equality of means, ADLA evaluates whether the two distributions share the same cumulative distribution function (CDF), thereby providing sensitivity to a broader class of leakage effects. We further derive an explicit decision threshold for ADLA to enable practical use in evaluation workflows. We validate ADLA on a neural-network inference implementation measured on a ChipWhisperer-based side-channel acquisition setup. We evaluate protected implementations employing shuffling and jitter. Our experiments show that ADLA detects leakage with substantially fewer traces than TVLA in this setting, including cases where TVLA remains below its detection threshold. These results indicate that distribution-sensitive assessment can be particularly valuable when countermeasures reduce mean-based leakage signatures. Contributions. This paper makes the following contributions: • We introduce ADLA, a leakage assessment method for neural-network implementations based on the two-sample Anderson–Darling test. • We derive a detection threshold for ADLA, enabling decision-making with a practical significance level to observe leakage. • We experimentally demonstrate that ADLA provides higher leakage-detection sensitivity than TVLA at low trace counts on protected implementations, improving time efficiency in practical evaluation campaigns. Practical relevance. From an evaluation perspective, improved sensitivity at low trace counts directly translates into shorter acquisition campaigns and reduced measurement cost. This is particularly beneficial for certification and testing laboratories, where throughput and time-to-result are critical and collecting very large trace sets may be impractical. 2 Related Work In this section, we review the background relevant to our work. We first introduce neural networks (Subsection 2.1), followed by an overview of side-channel analysis attacks on neural network implementations (Subsection 2.2). We then discuss existing countermeasures (Subsection 2.3) and conclude with a review of leakage assessment methodologies (Subsection 2.4). 2.1 Neural Networks Neural networks [13] are computational models composed of layers of interconnected neurons, whose behavior is governed by trainable parameters, typically weights and biases. During inference, these parameters, together with the chosen activation functions, determine the sequence of linear transformations and nonlinear mappings that produce the network output. A Multilayer Perceptron (MLP) is a fundamental class of feedforward neural networks consisting of an input layer, one or more hidden layers, and an output layer. Each neuron computes a weighted sum of its inputs, adds a bias term, and applies a nonlinear activation function. The network parameters are typically optimized via backpropagation [27] combined with gradient-based learning algorithms to minimize a task-specific loss function. In a standard fully connected MLP architecture, every neuron in a given layer is connected to all neurons in the subsequent layer. This dense connectivity enables the network to approximate nonlinear functions and makes MLPs suitable for a wide range of classification and regression tasks. 2.2 Side-channel Analysis Attacks on Neural Networks Side-channel analysis (SCA) attacks on neural network implementations typically assume a black-box setting in which the network architecture and model parameters are secret. The adversary observes physical side-channel information, so called side-channel leakages, such as timing behavior, power consumption, or electromagnetic (EM) emissions, during inference or training to gain information about the neural network. For example, differences in activation-function execution time may reveal the type of activation function used [2], power/EM side-channel information can leak sensitive input information [3], or expose internal architectural features such as layer types [33]. Of particular relevance to our work is the correlation power analysis (CPA) attack [2, 25, 18], in which statistical correlation is computed between hypothetical intermediate values (derived from candidate secret parameters) and measured side-channel leakages. By identifying the parameter hypothesis that maximizes the statistical dependence with the observed leakage, an attacker can recover secret model parameters. Such attacks fundamentally rely on the data-dependency of physical leakages and the statistical distinguishability of the resulting distributions. 2.3 Countermeasures Against Side-Channel Analysis Attacks Several countermeasures have been proposed to mitigate SCA attacks on neural network implementations. Desynchronization-based techniques [5] introduce jitters to the computations to randomize execution time and reduce the effectiveness of timing-based attacks. Masking approaches [11] randomize intermediate computations to decrease the statistical dependence between the measured leakage and sensitive variables, and have been demonstrated for neural networks with integer weights to hinder correlation power analysis (CPA). However, applying masking across an entire network typically incurs substantial computational and implementation overhead. In this work, we adopt two countermeasures against CPA attacks: shuffling and random jitter. Shuffling randomizes the order of multiplications within a layer to disrupt CPA [22], with subsequent work [25] protecting the shuffled index generation mechanism itself against SCA [6]. Random jitter introduces random delays into the computation to desynchronize side-channel measurements [8]. While well-studied in cryptographic implementations, its application to neural networks has mainly addressed timing-based attacks [5], and its impact on power-based CPA remains less explored. 2.4 Leakage Assessment Leakage assessment was originally developed for evaluating the side-channel security of cryptographic implementations. From a developer’s perspective, it provides a systematic methodology to determine whether an implementation exhibits detectable data-dependent leakage, without requiring knowledge of a specific attack strategy. As new side-channel attacks continue to emerge, it is generally impractical to validate resistance against each attack individually. Leakage assessment addresses this challenge by analyzing measured side-channel leakages and determining whether statistically significant data-dependent information is present [15]. Among the proposed methodologies, Test Vector Leakage Assessment (TVLA) [12] has become the de facto standard for evaluating cryptographic implementations. TVLA employs statistical hypothesis testing to detect leakage. More recently, TVLA has also been adopted to evaluate the side-channel security of neural network implementations. To the best of our knowledge, TVLA remains the only established leakage assessment methodology currently applied to neural network implementations. Further details on TVLA and its statistical foundations are provided in Subsections 3.2 and 4.1. Beyond mean-based TVLA, side-channel leakage can also be evaluated using χ2χ^2-type tests [20]. In our setting, however, these tests were shown to be highly sensitive to the discretization of the measurement space, particularly the choice of binning, thus making them hard to interpret. 3 Statistical Hypothesis Testing This section introduces the notation and basic concepts of statistical hypothesis testing (Subsection 3.1), reviews Welch’s t-test underlying TVLA (Subsection 3.2), and describes the two-sample Anderson–Darling test on which our ADLA framework is based (Subsection 3.3). 3.1 Notation and Preliminaries A statistical hypothesis [26] is a formal statement concerning one or more unknown parameters of the underlying probability distribution(s) governing the data. It is termed a hypothesis because its validity is not known a priori and must be assessed based on observed data. Statistical hypothesis testing is a methodological framework that uses sample data to evaluate such statements. More precisely, it provides a decision rule for determining whether the observed sample is consistent with a specified hypothesis about the underlying data-generating distribution(s). Based on the outcome of the test, the hypothesis is either rejected or not rejected. Importantly, failing to reject a hypothesis does not imply that it is true; rather, it indicates that the observed data do not provide sufficient evidence against it. The hypothesis under investigation is referred to as the null hypothesis, denoted by H0H_0. It is tested against a competing statement called the alternative hypothesis, denoted by H1H_1. The performance of the test is characterized by its significance level, denoted by α, which is defined as an upper bound on the probability of rejecting H0H_0 when H0H_0 is true (Type I error). For a given choice of α, a critical region (or equivalently, a decision threshold) is determined according to the distribution of the test statistic under H0H_0. In this paper, we focus on the comparison of two probability distributions. Let X and Y denote the corresponding random variables. Let X1,X2,…,Xn\X_1,X_2,…,X_n\ and Y1,Y2,…,Yn\Y_1,Y_2,…,Y_n\ be independent samples drawn from the distributions of X and Y, respectively111For simplicity, we assume that both samples have the same size n. This assumption is justified in the context of side-channel measurements, where it is typically feasible to collect an equal number of traces under different experimental conditions.. A test statistic is computed from the observed samples. If the value of this statistic falls within the critical region (equivalently, exceeds the predefined threshold), the null hypothesis H0H_0 is rejected; otherwise, it is not rejected. 3.2 Welch’s t-test Welch’s t-test [32] is a parametric test designed to compare the means of two normal distributions. Let X∼(μx,σx2)X ( _x, _x^2) and Y∼(μy,σy2)Y ( _y, _y^2) denote two independent random variables corresponding to the distributions under consideration. The null and alternative hypotheses are defined as H0:μx=μy,H1:μx≠μy.H_0: _x= _y, H_1: _x≠ _y. -5.69046pt The test statistic is given by t:=X¯−Y¯Sx2n+Sy2n,t:= X- Y S_x^2n+ S_y^2n, -5.69046pt (1) where X¯ X and Y¯ Y denote the sample means, and Sx2S_x^2 and Sy2S_y^2 denote the unbiased sample variances of the respective samples. Under the null hypothesis, the statistic t approximately follows a t-distribution. For sufficiently large sample sizes, the distribution of t converges to the standard normal distribution by the central limit theorem. In this asymptotic regime, the threshold corresponding to a significance level α is zα/2z_α/2, defined by Φ(zα/2)=1−α2, (z_α/2)=1- α2, -5.69046pt where Φ denotes the cumulative distribution function of the standard normal distribution. Equivalently, α2=1−Φ(zα/2). α2=1- (z_α/2). -4.26773pt (2) The null hypothesis is rejected if |t|>zα/2|t|>z_α/2. In this case, at significance level α, the observed data provide sufficient statistical evidence to reject H0H_0 in favor of H1H_1, indicating that the population means differ. 3.3 Two-sample Anderson-Darling Test The two-sample Anderson–Darling test [23] is a nonparametric procedure for testing whether two independent samples originate from the same (continuous) distribution. In contrast to mean-based tests, it compares the entire distributions via their cumulative distribution functions. Let FxF_x and FyF_y denote the cumulative distribution functions (CDFs) of the random variables X and Y, respectively. The null and alternative hypotheses are H0:Fx=Fy,H1:Fx≠Fy.H_0:F_x=F_y, H_1:F_x≠ F_y. -5.69046pt Let X1,…,Xn\X_1,…,X_n\ and Y1,…,Yn\Y_1,…,Y_n\ be two independent samples of equal size n. Consider the pooled sample of size 2n2n, arranged in increasing order, Z(1)≤Z(2)≤⋯≤Z(2n).Z_(1)≤ Z_(2)≤…≤ Z_(2n). -5.69046pt The pooled sample consists of all observations from both samples, ordered increasingly while retaining information about their sample of origin. For each i∈1,…,2n−1i∈\1,…,2n-1\, let MiM_i denote the number of observations among X1,…,Xn\X_1,…,X_n\ that are less than or equal to Z(i)Z_(i). The Anderson–Darling test statistic is defined as222The square in the notation is historical and reflects the fact that the statistic is a quadratic functional of the empirical process. A2:=1n2∑i=12n−1(2nMi−ni)2i(2n−i).A^2:= 1n^2 _i=1^2n-1 (2nM_i-ni)^2i(2n-i). -5.69046pt (3) Under the null hypothesis and assuming continuity of the common distribution, the statistic A2A^2 converges in distribution, as n→∞n→∞, to a nondegenerate limiting distribution. In the two-sample case this limiting distribution can be expressed as [28] A∞2:=∑j=1∞1j(j+1)Wj,A^2_∞:= _j=1^∞ 1j(j+1)W_j, -5.69046pt (4) where Wjj≥1\W_j\_j≥ 1 are independent chi-square random variables with one degree of freedom. Since the limiting distribution does not admit a closed-form expression, thresholds corresponding to a prescribed significance level α are obtained from tabulated asymptotic percentiles or via numerical approximation. In particular, Scholz and Stephens [28] computed approximate percentiles by matching the first four moments of the limiting distribution and fitting a Pearson curve, following the methodology of Stephens [31] and Solomon and Stephens [30]. This approach has been shown to provide accurate approximations. 4 Anderson-Darling Leakage Assessment This section introduces the ADLA framework for detecting secret-dependent leakage in neural network implementations. Subsection 4.1 formulates leakage detection as a statistical hypothesis test and connects it to TVLA based on Welch’s t-test (Subsection 3.2). Subsection 3.3 motivates the two-sample Anderson–Darling test as a distribution-sensitive alternative to mean-based methods, while Subsection 4.2 derives the ADLA threshold. 4.1 Leakage Detection as a Statistical Hypothesis Test As discussed in Section 2, SCAs targeting neural networks aim to recover secret parameters by exploiting statistical dependencies between side-channel observations and the secret values. A leakage assessment method in this setting aims to determine whether such secret-dependent leakage is present. To formalize leakage detection within the framework of statistical hypothesis testing, we model side-channel measurements as realizations of random variables. Under the null hypothesis of no secret-dependent leakage, the distribution of the measured leakage should be independent of the secret value. Consequently, the leakage distributions corresponding to different secret-dependent conditions should be identical. In practice, leakage detection is typically conducted by collecting two sets of measurements: one obtained under a fixed input and another obtained under a different fixed input. In the context of neural networks, we consider a specific input neuron corresponding to the secret parameter under investigation. For example, when evaluating potential leakage associated with the first weight in the first hidden layer, the value of the corresponding input neuron is varied while all other inputs are kept constant. This setting reflects a realistic attack scenario in which an attacker controls a chosen input neuron in order to induce variations in the intermediate computation involving the secret weight, thereby potentially amplifying secret-dependent leakage. Under the null hypothesis of no data-dependent leakage, a necessary condition is that the distributions of side-channel leakages collected under the two different fixed input configurations are identical. If a statistically significant difference between these distributions is observed, the null hypothesis is rejected, indicating the presence of data-dependent leakage. The TVLA methodology evaluates leakage by testing equality of means using Welch’s t-test under an approximate normality assumption [15]. A significant difference in sample means implies a difference in the underlying distributions and thus indicates data-dependent leakage. However, failure to reject the null hypothesis does not imply the absence of leakage – it merely indicates that no statistically significant mean difference was detected. The motivation for employing the two-sample Anderson–Darling test (AD test) in our setting is that it compares the entire distributions rather than only their means. This allows detection of more general forms of distributional differences, including variance or tail discrepancies, which may not be captured by mean-based tests. 4.2 Derivation of the ADLA Threshold In standard TVLA practice, the detection threshold is set to τt:=4.5 _t:=4.5 [12, 10]. That is, after computing the test statistic t (cf. Eq. 1), leakage is declared if |t|>τt=4.5.|t|> _t=4.5. -5.69046pt Under the asymptotic normal approximation (cf. Eq. 2), this threshold corresponds to a two-sided significance level of approximately α≈3.4×10−6, -4.26773ptα≈ 3.4× 10^-6, -5.69046pt providing a highly conservative criterion for leakage detection [15]. To ensure direct comparability with the standard TVLA methodology, we adopt the same significance level α in our proposed Anderson–Darling Leakage Assessment (ADLA) framework. As discussed in Subsection 3.3, the limiting distribution of the two-sample Anderson–Darling statistic does not admit a closed-form expression. Consequently, the corresponding critical value must be determined numerically. We denote the threshold for ADLA by τA _A, defined as the upper (1−α)(1-α)-quantile of the limiting distribution: Pr(A∞2>τA)=α≈3.4×10−6. \! (A^2_∞> _A )=α≈ 3.4× 10^-6. -4.26773pt To approximate τA _A, we employ the Pearson curve fitting method [28]. Using the additivity and scaling properties of cumulants for independent random variables [17] and Eq. 4, the rrth cumulant of A∞2A^2_∞ is given by κr=2r−1(r−1)!∑j=1∞1(j(j+1))r. _r=2^r-1(r-1)! _j=1^∞ 1 (j(j+1) )^r. -4.26773pt The infinite series can be evaluated in closed form via partial fraction decomposition [14]. For r=1,2,3,4r=1,2,3,4, we obtain ∑j=1∞1(j(j+1))r=1,r=1,π23−3,r=2,10−π2,r=3,π445+10π23−35,r=4. _j=1^∞ 1(j(j+1))^r= cases1,&r=1,\\[6.0pt] π^23-3,&r=2,\\[8.0pt] 10-π^2,&r=3,\\[8.0pt] π^445+ 10π^23-35,&r=4. cases Hence, the first four cumulants of A∞2A^2_∞ are κ1=1,κ2=2π23−6,κ3=80−8π2,κ4=16π415+160π2−1680. _1=1, _2= 2π^23-6, _3=80-8π^2, _4= 16π^415+160π^2-1680. -4.26773pt Using the standard relations between cumulants and central moments [19], the first four moments satisfy μ1=κ1,μ2=κ2,μ3=κ3,μ4=κ4+3κ22. _1= _1, _2= _2, _3= _3, _4= _4+3 _2^2. -5.69046pt The skewness and kurtosis are therefore given by [29] γ1=μ3μ23/2,γ2=μ4μ22. _1= _3 _2^3/2, _2= _4 _2^2. -5.69046pt μ1 _1, μ2 _2, γ1 _1, and γ2 _2 uniquely determine a member of the Pearson system, which we use to approximate the limiting distribution underlying ADLA. The Pearson curve fitting is performed using the PearsonDS package in R [4]. For α=3.4×10−6α=3.4× 10^-6, the resulting ADLA threshold value is τA≈11.99. -4.26773pt _A≈ 11.99. -5.69046pt Figure 1: Normalized TVLA and ADLA statistics versus the number of traces n for three fixed input-value pairs in the protected implementation. 5 Experimental Evaluation In this section, we first describe the experimental setup in Subsection 5.1 and subsequently present and discuss the evaluation results in Subsection 5.2. (a) (b) Figure 2: TVLA and ADLA statistics, (a) |t||t| and (b) A2A^2, evaluated at each time sample for n=850n=850 traces in the protected implementation. 5.1 Experimental Setup To evaluate the proposed ADLA framework, we trained a multilayer perceptron (MLP) on the MNIST handwritten digit dataset [9] and collected side-channel power measurements using the ChipWhisperer-Husky platform [21]. MNIST comprises grayscale images of handwritten digits (0–99), where each sample is represented as a 28×2828× 28 pixel array. The dataset contains 60,00060,000 training samples and 10,00010,000 test samples and is widely used as a benchmark for image classification. Prior to training and evaluation, pixel intensities were normalized to the range [0,1][0,1]. The evaluated MLP consists of an input layer, three fully connected hidden layers, and an output layer with 784784, 256256, 128128, 6464, and 1010 neurons, respectively. The hidden layers employ rectified linear unit (ReLU) activations, and the output layer uses a softmax activation. All computations were performed in 32-bit floating-point arithmetic. The trained network achieves 97.03%97.03\% classification accuracy on the MNIST test set. Side-channel measurements were acquired using a ChipWhisperer-Husky capture device connected to a CW313 target board equipped with an Atmel SAM4S (ARM Cortex-M4) microcontroller as a device under test (DUT). The DUT was clocked at 7.3728 MHz, and the ADC sampling rate was set to 4×4× that frequency. During inference, the device power consumption was recorded as time-series traces, each trace contains 120,000120,000 samples. The evaluated implementation combines the shuffling countermeasure proposed in [25] with random jitter to further desynchronize the measured traces. Shuffling randomizes the execution order of multiplications, while jitter inserts a pseudo-random delay immediately before each multiplication. The delay is implemented as a bounded busy-wait loop with a pseudo-random iteration count (e.g., rand() & 127) and is guarded against compiler optimization using a volatile sink and a memory barrier. Following the leakage assessment methodology described in Subsection 4.1, we target the first network weight. Two sets of measurements were generated by applying two distinct values to the first input neuron, while keeping all remaining input neurons fixed across all captures. This procedure yields two sets of traces corresponding to two experimental conditions that induce different intermediate computations involving the targeted weight. Leakage assessment was then performed independently at each time sample by comparing the empirical distributions of the two trace sets. Under the TVLA methodology, the null hypothesis assumes equal means for the leakage distributions associated with the two input conditions. In contrast, the proposed ADLA framework tests whether the two leakage distributions share the same cumulative distribution function (CDF). Due to memory constraints on the target device, only the computation of the first hidden neuron was implemented and executed during the measurements. Since the targeted weight directly contributes to this computation through the associated multiply–accumulate operations, this restricted implementation remains representative for evaluating potential side-channel leakage. Leakage assessment was conducted on the recorded traces without additional preprocessing. 5.2 Evaluation Results We performed leakage assessments for varying numbers of traces per experimental condition. Using the notation introduced in Subsections 3.2 and 3.3, let n denote the number of traces in each of the two trace sets. For each selected value of n, we collected two sets of n traces under two fixed values applied to the first input neuron, while keeping all remaining input neurons constant. Specifically, we evaluated three different pairs of fixed values for the first input neuron (all within [0,1][0,1]), and fixed the remaining inputs to the pixel values of a randomly selected MNIST image. For each time sample of the recorded traces, we compared the two trace sets by computing (i) the TVLA statistic, i.e., the absolute Welch t-statistic |t||t| defined in Eq. (1), and (i) the ADLA statistic, i.e., the two-sample Anderson–Darling statistic A2A^2 defined in Eq. (3). To facilitate a direct comparison between the two methodologies, we report threshold-normalized test statistics obtained by dividing each statistic by its corresponding detection threshold. Specifically, we plot |t|τt |t| _t and A2τA A^2 _A, where τt=4.5 _t=4.5 is used for TVLA and τA=11.99 _A=11.99 is used for ADLA. With this normalization, values exceeding 11 indicate rejection of the null hypothesis (i.e., detectable leakage) at the significance level α=3.4×10−6α=3.4× 10^-6 adopted throughout this work. Figure 1 summarizes the normalized TVLA and ADLA results for the protected implementation. To further illustrate the difference between both tests, Figure 2 reports the test statistics for a representative input-pair experiment with n=850n=850 traces. In this instance, the TVLA statistic |t||t| remains below the detection threshold across the trace, whereas the ADLA statistic exceeds its threshold at two time samples, indicating statistically significant leakage under ADLA but not under TVLA. Overall, the experimental results indicate that ADLA is more sensitive in this setting, enabling leakage detection with fewer traces than TVLA. This difference is reflected not only in the detection outcome but also in the margin above the decision threshold: whereas |t|τt |t| _t typically remains close to 11 and exhibits only modest threshold exceedances, A2τA A^2 _A surpasses 11 by a substantially larger factor across the tested input pairs. These observations suggest that the leakage is not primarily characterized by a shift in the mean, but rather by broader distributional differences, which are captured by ADLA and may not be fully reflected by the mean-based TVLA statistic. To assess whether the leakage samples follow a Gaussian distribution – an assumption adopted when interpreting Welch’s t-test (Subsection 3.2) in the TVLA methodology, we further employ a quantile–quantile (Q–Q) plot [7] with respect to the normal distribution. For a fixed time sample, the measured leakage values are sorted to obtain empirical quantiles and plotted against the corresponding theoretical quantiles of a standard normal distribution. If the leakage were normally distributed (up to an affine transformation), the plotted points would lie approximately on a straight line (the normal reference line). Figure 3: Q–Q plot of leakage samples at time sample t=1316t=1316, obtained from a dataset of n=1000n=1000 traces corresponding to a fixed input value. Figure 3 shows the Q–Q plot constructed from a dataset of n=1000n=1000 traces at time sample t=1316t=1316, using measurements collected under a single fixed input configuration. The time sample t=1316t=1316 corresponds to the highest peak observed in Fig. 2 (b). The pronounced deviation of the empirical quantiles from the normal reference line confirms that the leakage distribution at this time sample is not Gaussian. 6 Conclusion and Future Work We introduced and evaluated the ADLA framework as a distribution-sensitive alternative to TVLA. For the shuffling- and jitter-protected implementation considered in this work, ADLA consistently detected leakage at relatively low trace counts, including cases where TVLA remained below its detection threshold. These results suggest that, in this setting, leakage is not necessarily dominated by mean shifts, but can instead arise from broader distributional differences that are captured by ADLA. χ2χ^2-based leakage tests. In addition to mean-based t-tests and distribution-based tests such as ADLA, side-channel leakage can also be assessed using χ2χ^2-type tests, as discussed in [20]. In our setting, however, we observed that χ2χ^2-based results are highly sensitive to the discretization of the measurement space, i.e., to the choice of the number of bins and bin boundaries used to form the contingency table. Designing a robust χ2χ^2-based test for this setting—including principled binning strategies and stability analyses across noise regimes—is therefore an interesting direction for future work. Higer-order power analysis. As another future work, we will investigate whether the leakage points detected by ADLA can be exploited to reveal secret weights. While CPA is ineffective against shuffling [25], the leakage detected by ADLA motivates investigating stronger adversaries. Future work will therefore consider higher-order attacks and statistical analyses [24] to assess exploitability. References [1] L. Batina, S. Bhasin, J. Breier, X. Hou, and D. Jap (2022) On implementation-level security of edge-based machine learning models. In Security and Artificial Intelligence: A Crossdisciplinary Approach, p. 335–359. Cited by: §1. [2] L. Batina, S. Bhasin, D. Jap, and S. Picek (2019) \csi\\n\: Reverse engineering of neural network architectures through electromagnetic side channel. In 28th USENIX Security Symposium (USENIX Security 19), p. 515–532. Cited by: §1, §2.2, §2.2. [3] L. Batina, S. Bhasin, D. Jap, and S. Picek (2019) Poster: recovering the input of neural networks via single shot side-channel attacks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, p. 2657–2659. Cited by: §2.2. [4] M. Becker and S. Klößner (2025) PearsonDS: pearson distribution system. Note: R package version 1.3.2 External Links: Link, Document Cited by: §4.2. [5] J. Breier, D. Jap, X. Hou, and S. Bhasin (2023) A desynchronization-based countermeasure against side-channel analysis of neural networks. In International Symposium on Cyber Security, Cryptology, and Machine Learning, p. 296–306. Cited by: §1, §2.3, §2.3. [6] M. Brosch, M. Probst, and G. Sigl (2022) Counteract side-channel analysis of neural networks by shuffling. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), p. 1305–1310. Cited by: §2.3. [7] J. M. Chambers (2018) Graphical methods for data analysis. Chapman and Hall/CRC. Cited by: §5.2. [8] J. Coron and I. Kizhvatov (2009) An efficient method for random delay generation in embedded software. In Cryptographic Hardware and Embedded Systems-CHES 2009: 11th International Workshop Lausanne, Switzerland, September 6-9, 2009 Proceedings, p. 156–170. Cited by: §2.3. [9] L. Deng (2012) The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine 29 (6), p. 141–142. Cited by: §5.1. [10] A. A. Ding, L. Zhang, F. Durvaux, F. Standaert, and Y. Fei (2018) Towards sound and optimal leakage detection procedure. In Smart Card Research and Advanced Applications: 16th International Conference, CARDIS 2017, Lugano, Switzerland, November 13–15, 2017, Revised Selected Papers, p. 105–122. Cited by: §4.2. [11] A. Dubey, A. Ahmad, M. A. Pasha, R. Cammarota, and A. Aysu (2022) Modulonet: neural networks meet modular arithmetic for efficient hardware masking. IACR Transactions on Cryptographic Hardware and Embedded Systems, p. 506–556. Cited by: §1, §2.3. [12] B. J. Gilbert Goodwill, J. Jaffe, P. Rohatgi, et al. (2011) A testing methodology for side-channel resistance validation. In NIST non-invasive attack testing workshop, Vol. 7, p. 115–136. Cited by: §1, §2.4, §4.2. [13] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §2.1. [14] M. Hata (2007) Problems and solutions in real analysis. Series on number theory and its applications, World Scientific. External Links: ISBN 9789812776013, LCCN 2008295629, Link Cited by: §4.2. [15] X. Hou and J. Breier (2024) Cryptography and embedded systems security. Springer. Cited by: §2.4, §4.1, §4.2. [16] Cited by: §1. [17] J. E. Kolassa (2006) Series approximation methods in statistics. Springer. Cited by: §4.2. [18] Z. Lehockỳ, J. Breier, D. Jap, S. Bhasin, and X. Hou (2025) Side-channel analysis of openvino-based neural network models. In International Conference on Availability, Reliability and Security, p. 307–324. Cited by: §2.2. [19] R. C. Mittelhammer and R. C. Mittelhammer (2013) Mathematical statistics for economics and business. Springer. Cited by: §4.2. [20] A. Moradi, B. Richter, T. Schneider, and F. Standaert (2018) Leakage detection with the χ2χ^2-test. IACR Transactions on Cryptographic Hardware and Embedded Systems 2018 (1), p. 209–237. Cited by: §2.4, §6. [21] NewAE Technology Inc. (2025) ChipWhisperer-husky and huskyplus user manual. NewAE Technology Inc.. External Links: Link Cited by: §5.1. [22] Y. Nozaki and M. Yoshikawa (2021) Shuffling countermeasure against power side-channel attack for mlp with software implementation. In 2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE), p. 39–42. Cited by: §2.3. [23] A. Pettitt (1976) A two-sample anderson–darling rank statistic. Biometrika, p. 161–168. Cited by: §3.3. [24] E. Prouff, M. Rivain, and R. Bevan (2009) Statistical analysis of second order differential power analysis. IEEE Transactions on computers 58 (6), p. 799–811. Cited by: §6. [25] L. Puškáč, M. Benovič, J. Breier, and X. Hou (2025) Make shuffling great again: a side-channel-resistant fisher–yates algorithm for protecting neural networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. Cited by: §1, §2.2, §2.3, §5.1, §6. [26] S. M. Ross (2020) Introduction to probability and statistics for engineers and scientists. Academic press. Cited by: §3.1. [27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. nature 323 (6088), p. 533–536. Cited by: §2.1. [28] F. Scholz and M. Stephens (1986) K-sample anderson-darling tests of fit, for continuous and discrete cases. University of Washington, Technical Report (81). Cited by: §3.3, §3.3, §4.2. [29] J. K. Sharma (2012) Business statistics. Pearson Education India. Cited by: §4.2. [30] H. Solomon and M. A. Stephens (1978) Approximations to density functions using pearson curves. Journal of the American Statistical Association 73 (361), p. 153–160. Cited by: §3.3. [31] M. A. Stephens (1976) Asymptotic results for goodness-of-fit statistics with unknown parameters. The annals of statistics, p. 357–369. Cited by: §3.3. [32] B. L. Welch (1947) The generalization of ‘student’s’problem when several different population varlances are involved. Biometrika 34 (1-2), p. 28–35. Cited by: §3.2. [33] X. Yan, X. Lou, G. Xu, H. Qiu, S. Guo, C. H. Chang, and T. Zhang (2023) Mercury: an automated remote side-channel attack to nvidia deep learning accelerator. arXiv preprint arXiv:2308.01193. Cited by: §2.2.