Paper deep dive
KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference
Sohaib Errabii, Olivier Sentieys, Marcello Traiola
Abstract
Abstract:Kolmogorov-Arnold Networks (KANs) have gained attention for their potential to outperform Multi-Layer Perceptrons (MLPs) in terms of parameter efficiency and interpretability. Unlike traditional MLPs, KANs use learnable non-linear activation functions, typically spline functions, expressed as linear combinations of basis splines (B-splines). B-spline coefficients serve as the model's learnable parameters. However, evaluating these spline functions increases computational complexity during inference. Conventional quantization reduces this complexity by lowering the numerical precision of parameters and activations. However, the impact of quantization on KANs, and especially its effectiveness in reducing computational complexity, is largely unexplored, particularly for quantization levels below 8 bits. The study investigates the impact of low-bit quantization on KANs and its impact on computational complexity and hardware efficiency. Results show that B-splines can be quantized to 2-3 bits with negligible loss in accuracy, significantly reducing computational complexity. Hence, we investigate the potential of using low-bit quantized precomputed tables as a replacement for the recursive B-spline algorithm. This approach aims to further reduce the computational complexity of KANs and enhance hardware efficiency while maintaining accuracy. For example, ResKAN18 achieves a 50x reduction in BitOps without loss of accuracy using low-bit-quantized B-spline tables. Additionally, precomputed 8-bit lookup tables improve GPU inference speedup by up to 2.9x, while on FPGA-based systolic-array accelerators, reducing B-spline table precision from 8 to 3 bits cuts resource usage by 36%, increases clock frequency by 50%, and enhances speedup by 1.24x. On a 28nm FD-SOI ASIC, reducing the B-spline bit-width from 16 to 3 bits achieves 72% area reduction and 50% higher maximum frequency.
Tags
Links
- Source: https://arxiv.org/abs/2603.17230v1
- Canonical: https://arxiv.org/abs/2603.17230v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
72,492 characters extracted from source content.
Expand or collapse full text
KANtize: Exploring Low-bit Quantization of Kolmogorov-Arnold Networks for Efficient Inference Sohaib Errabii, Olivier Sentieys, Marcello Traiola Univ Rennes, Inria, CNRS, IRISA, Rennes, France Abstract Kolmogorov-Arnold Networks (KANs) have garnered significant interest due to their potential for superior parameter efficiency and interpretability compared to Multi-Layer Perceptrons (MLPs). Their key innovation is the use of learnable non-linear activation functions, rather than traditional learnable linear weights. Those functions are typically parametrized as spline functions, often expressed as linear combinations of basis splines (B-splines). The B-spline coefficients are the model’s new learnable parameters. However, spline function evaluation significantly increases inference computational complexity. One conventional way to reduce such complexity is quantization, which reduces the numerical precision of parameters and activations. The impact of quantization on KANs, and especially its effectiveness in reducing computational complexity, is largely unexplored, particularly for quantization levels below 8 bits. In this work, we examine KANs’ behavior when using existing standard post-training weight and activation quantization and its impact on computational complexity and hardware efficiency. We find that B-splines are highly robust to quantization and yield substantial reductions in computational complexity, whereas the learnable parameters are more sensitive to quantization. Hence, we investigate the potential of using low-bit quantized precomputed tables as a replacement for the recursive B-spline algorithm. This approach aims to further reduce the computational complexity of KANs and enhance hardware efficiency while maintaining accuracy. We performed experiments on six state-of-the-art KAN models, including both MLP-based and convolution-based models. Our results show that B-spline inputs/outputs can be quantized to 2-3 bits with negligible drop in accuracy, while substantially reducing computational complexity. For example, for ResKAN18, we show that a BitOps reduction of more than 50×50× is possible without loss of accuracy, thanks to low-bit quantized B-spline tabulation. Furthermore, using pre-computed 8-bit lookup tables enables up to 2.9×2.9× inference speedup on GPUs, while on a systolic-array-based KAN accelerator, reducing the B-spline table precision from 8 to 3 bits on FPGA yields up to 36% resource reduction, 50% higher clock frequency and 1.24×1.24× inference speedup. On a 28nm FD-SOI ASIC, reducing the B-spline bit-width from 16 to 3 bits achieves 72% area reduction and 50% higher maximum frequency. I Introduction Kolmogorov-Arnold Networks (KANs) are a new Deep Neural Network (DNN) architecture originally designed as an alternative to Multi-Layer Perceptrons (MLPs) [20]. KANs have gained significant attention and are now used in a range of applications, including time series analysis [28], recommender systems [22], and medical image segmentation [17]. Their main advantages over traditional Deep Neural Networks (DNNs) are improved parameter efficiency and explainability. The primary characteristic that makes KANs unique is the replacement of conventional edge weights with learnable activation functions. For example, learnable spline functions have been used in the original paper [20]. These splines ϕφ can be expressed as a linear combination of basis functions (B-splines) b()b(), i.e., ϕ(x)=∑iwibi(x)φ(x)= _iw_ib_i(x). Therefore, in KANs, the learnable parameters are the B-spline coefficients wiw_i. However, from a computational perspective, adding spline function evaluation to the conventional multiplication significantly increases the computational cost, proportionally to the size of the basis. Indeed, to compute KAN inference, instead of performing a single scalar multiplication, each basis function must first be evaluated at the input, followed by a linear combination of all basis functions. Moreover, their computation is costly because of the B-spline functions’ recursive evaluation (see Section I-A). Therefore, research efforts have been focusing on new approaches to accelerate KANs. Among recent studies, compute-in-memory (CIM) approaches have been proposed [27, 13]. In such studies, different ways to approximate the learned non-linear KAN functions or their basis are utilized, such as piece-wise linear (PWL) approximation [27]. A recent approach [29] focuses on the B-spline evaluation bottleneck. It proposes an efficient dataflow acceleration methods for the Cox-de Boor recursive formula, achieving a significant speedup over CPU and GPU implementations. Another conventional approach to tackling the high computing demands of DNNs is quantization. The quantization process aims at reducing the bit precision of DNN weights and/or activations to low-bit representations (e.g., 8 bits or lower, instead of 32 bits) and to use simpler data types, e.g., fixed-point instead of floating-point. Not only does this reduce the DNN memory requirements, but it also enables the use of simpler and more efficient hardware (e.g., integer arithmetic units instead of floating-point ones) or packing more operands per instruction. In turn, this improves the latency and the overall efficiency. A recent study had proposed a co-design strategy for efficiently implementing KANs on RRAM-based analog compute-in-memory (CiM) architectures [13]. In this context, a hardware-aware 8-bit quantization method was proposed to reduce the cost of B-spline evaluation, enabling an efficient 8-bit tabulation of B-splines. More recently, a novel systolic-array-based accelerator that enables efficient end-to-end inference of KANs via non-recursive tabulated 8-bit B-spline evaluation was proposed, achieving a 2×2× speedup over conventional systolic arrays [9]. However, the quantization can be pushed to lower precision. A recent preprint studies KAN quantization, evaluating their accuracy loss when low-bit precision is used [10]. Unfortunately, such work neither studies computational complexity nor proposes methods to leverage quantization to achieve practical computational efficiency (e.g., faster inference, increased clock frequency, or optimized resource utilization). Moreover, it considers only the quantization of KAN’s activations and coefficients, neglecting the quantization of B-spline outputs, and does not explore the quantization space to assess the impact of quantizing the individual KAN tensor components and all their combinations. A recent paper explores the quantized tabulation of the entire set of KAN learnable spline activation functions and their efficient mapping to FPGA LUT fabrics to reduce computational complexity, avoiding the B-spline basis computation [12]. The work highlights encouraging results with low precision (i.e., 5–8 bits). In this work, we explore the impact of standard post-training uniform integer quantization on KANs. Beyond conventional weight and activation quantization [10], we also consider quantizing B-spline outputs, which are specific to KAN inference and have not been explored previously. We study the impact of quantizing these three components (weights, activations, and B-spline outputs) both in isolation and jointly. Moreover, we study, for the first time, the beneficial effect of KAN’s low-bit quantization on computational complexity by experimenting on GPU and a systolic-array-based KAN hardware accelerator [9]. We show that the different KANs components exhibit distinct sensitivities to quantization. As also previously highlighted [10], we find that learnable parameters are particularly sensitive, with significant accuracy degradation below 5 bits for most models. Activations are more resilient than coefficients, tolerating 5-bit quantization with limited impact on accuracy. For the first time, we observe that B-spline outputs are the most resilient, tolerating quantization down to 3 bits for most models. Building on these findings, we explore the use of low-bit quantized precomputed tables to replace the recursive B-spline algorithm, significantly reducing KAN computational complexity while maintaining accuracy. While this has been explored for 8-bit quantization [9], to the best of our knowledge, no evaluation of the computational advantages of low-bit quantized precomputed tables in KAN has been performed. Our results show that, on a KAN-SAs-style systolic-array accelerator [9] deployed on FPGA, reducing the B-spline table precision from 8 to 3 bits lowers LUT utilization by 36% at the same array size, enables a 50% higher clock frequency, and yields a measured 1.43×1.43× overall speedup while maintaining accuracy. ASIC synthesis in 28nm FD-SOI shows similar trends. Halving both B-spline and weight precision from 16 to 8 bits reduces cell area by 60% and raises the maximum frequency by 33%, while further reducing B-spline precision to 3 bits achieves an overall 72% area reduction and 50% higher frequency over the 16-bit baseline. Finally, on GPU, using 8-bit B-spline tabulation yields a 2.12.1–2.9×2.9× inference speedup with no loss in accuracy. Finally, we consider the use of low-bit quantized precomputed tables to replace the entire spline computation, as in [12], and highlight the advantages and limitations of this approach compared to using precomputed tables to replace the recursive B-spline. Our exploration shows that, due to the nature of the two approaches, replacing spline computation is particularly convenient for models of limited dimension, whereas scalability becomes challenging for larger models. In summary, the contributions of this paper are: 1. A comprehensive evaluation of how a standard quantization scheme affects MLP-based and convolutional KANs. The sensitivity of the different tensor components of the KAN layers to quantization is investigated. 2. An exploration of how combining different quantization levels across KAN tensor components yields different trade-offs between accuracy and computational complexity. 3. An analysis of the improvements in inference runtime provided by B-spline low-bit quantization and tabulation on GPU and a systolic-array-based KAN hardware accelerator, and the obtained hardware efficiency improvement when synthesizing the latter to FPGA and ASIC. 4. An analysis of the trade-offs between accuracy and computational complexity achieved through low-bit quantization and tabulation of splines, as well as its limitations when larger models are utilized. I Background and Motivation In this section, we summarize the theoretical background of Kolmogorov-Arnold Networks (Section I-A), which essentially involves replacing the weights of conventional neural networks with learnable splines. In Section I-B, we show the implications of this in terms of computational complexity and number of parameters. Finally, in Sections I-C and I-D, we provide background on the methods used in this work, i.e., uniform integer quantization and tabulation of activation functions, respectively. I-A Kolmogorov-Arnold Networks The Kolmogorov-Arnold representation theorem states that if f is a multivariate continuous function on a bounded domain, then f can be written as a finite composition of continuous functions of a single variable and the binary operation of addition [15, 2]. More specifically, f(a)=f(a1,…,an)=∑q=02nΦq(∑p=1nϕq,p(ap)),f(a)=f (a_1,…,a_n )= _q=0^2n _q ( _p=1^n _q,p (a_p ) ), (1) where ϕq,p:[0,1]→ _q,p:[0,1] and Φq:→ _q:R . However, as there is no smoothness guarantee for these 1D functions, which is known to be critical for deep neural networks [25], the Kolmogorov-Arnold representation theorem has not had much success in machine learning. Recently, a generalization of this theorem defined a single layer that can be stacked to construct neural networks of arbitrary width and depth [20]. Under this generalization, the Kolmogorov-Arnold representation theorem becomes a particular case of a 2-layer KAN [n,2n+1,1][n,2n+1,1], i.e., the first layer has Nin=nN_in=n input neurons and Nout=2n+1N_out=2n+1 output neurons, and the second one Nin=2n+1N_in=2n+1, Nout=1N_out=1. This generic KAN layer can be intuitively pictured as a 2-layer MLP (Figure 1(a)), where the weights at the edges are replaced with learnable activation functions, and the fixed activation (e.g., ReLU) at the nodes is simply replaced with a summation, as depicted in Figure 1(b). Figure 1: A single layer of dimensions [2, 2] for MLP (a) and KAN (b). Details of spline computation as a linear combination of B-splines To address the missing guarantee of smoothness, the authors proposed to use parameterized splines as the learnable activations ϕq,p _q,p. Splines are often used to interpolate or approximate data points in a smooth and continuous manner. To parameterize these splines, various basis functions can be considered, such as Radial Basis Functions (RBF), Fourier, or B-splines [18]. In this work, we focus on the B-spline basis functions (Figure 1(c)) originally proposed in [20], and widely used in KAN literature. Briefly, B-splines are evaluated as a function of the input (i.e., bi,P(a)b_i,P(a)), the results are multiplied by learned coefficients wiw_i and then summed to obtain the spline value ϕ(a)φ(a). In more detail, B-spline is a family of piecewise polynomial functions bi,Pb_i,P defined by its degree P (e.g., P=3P=3 for cubic splines as shown in Figure 2) and a knot vector t, which specifies the points where the polynomial segments connect. These functions can be computed by means of the Cox–de Boor recursion formula [7] as bi,P(a)=a−titi+P−tibi,P−1(a)+ti+P+1−ati+P+1−ti+1bi+1,P−1(a)b_i,P(a)= a-t_it_i+P-t_ib_i,P-1(a)+ t_i+P+1-at_i+P+1-t_i+1b_i+1,P-1(a) (2) with bi,0(a)=1 if ti≤a<ti+10 otherwise. b_i,0(a)= cases1& if t_i≤ a<t_i+1\\ 0& otherwise. cases (3) The suitability of B-splines as a basis for spline functions is due to a fundamental theorem from Curry and Schoenberg [6, 5], which states that any spline ϕφ of degree P can be expressed as a linear combination of B-splines, i.e., they form a basis for the spline space: ϕi,j(a)=∑kbk,P(a)wi,k,j _i,j(a)= _kb_k,P(a)w_i,k,j (4) B-spline basis functions have local support. Indeed, from the Cox-de Boor formula, we can derive that the function bi,Pb_i,P is only non-zero in the interval [ti,ti+P+1][t_i,t_i+P+1]. The grid t is defined by first discretizing the input domain (the range of values taken by the KAN layer inputs) into G intervals. The grid is then extended by P points on both sides in order to account for all non-zero B-splines within the input domain. The grid has G+2P+1G+2P+1 points, [t0,t1,t2,…,tG+2P] [t_0,t_1,t_2,…,t_G+2P ], within which G + P non-zero B-splines are needed to evaluate the spline function according to Eq. 4. As an example, Figure 2 reports B-Spline basis functions of degree P=3P=3. The input domain is divided into G=3G=3 intervals, leading to G+2P+1=10G+2P+1=10 grid points, within which G+P=6G+P=6 B-splines form a basis for the spline function evaluation. Each B-spline bk,3b_k,3 is non-zero in the interval [tk,tk+4][t_k,t_k+4]. Figure 2: B-Spline basis for G=3G=3, P=3P=3 and G+2P+1=10G+2P+1=10 grid points [t0,…,t9][t_0,…,t_9]. Finally, as reported in Eq. 4 and sketched in Figure 1(b-c), evaluating ϕi,j(ai) _i,j(a_i) involves computing the G+PG+P B-splines at aia_i, multiplying each by its corresponding learnable coefficient, and summing the results. Then, as indicated in Eq. 5, all the spline functions are aggregated to produce the output of the layer. aj(l+1)=∑i=0Nin(l)∑k=0Nb−1bk,P(ai(l))wi,k,j(l)a^(l+1)_j= _i=0^N^(l)_in _k=0^N_b-1b_k,P(a^(l)_i)w^(l)_i,k,j (5) This is equivalent to a matrix multiplication of the form, A(l+1)=B(l)W(l)=b(A(l))W(l)A^(l+1)=B^(l)W^(l)=b(A^(l))W^(l) (6) Convolutional KAN: The KAN operation can also be applied to convolution by replacing the scalar filter weights with learnable activations. In practice, similarly to how standard convolution can be implemented as im2col followed by matrix multiplication [4], a convolutional KAN unfolds the input feature map into a matrix of patches, then applies the same KAN operation. An open-source implementation of this ConvKAN operation is available in [26]. Efficient training and design of CNN networks based on this convolution operation have also been studied in [8]. In this paper, to distinguish between the MLP- and CNN-based KANs, we refer to the convolutional one as ConvKAN. I-B Increased number of operations in KANs Figure 3: The compute graph of a KAN layer l consists of B-spline evaluation followed by a standard matrix multiplication. Figure 3 sketches a KAN layer l from a computational perspective. Firstly, B-splines are evaluated as a function of an activation matrix AlA^l, which generates an intermediate matrix BlB^l. The latter is then multiplied by the matrix of learned coefficients WlW^l. The dimensions of the matrices AlA^l, BlB^l and WlW^l are respectively M×NinlM× N_in^l, M×Ninl(G+P)M× N_in^l(G+P) and Ninl(G+P)×NoutlN_in^l(G+P)× N_out^l, where M is the batch size and NinlN_in^l and NoutlN_out^l the number of input and output neurons of the layer l. I-B1 Increased model parameters and number of operations Although KANs are more parameter efficient than the non-KAN counterpart [20], parameterization using a function basis leads to an increase in model parameters by a factor of G+PG+P, compared to non-KAN MLPs or CNNs, for the same layer dimension [Ninl,Noutl] [N^l_in,N^l_out ]. For instance, a KAN layer with cubic B-splines (P=3P=3 and G=5G=5) would have 8×8× as many parameters as an MLP layer of the same dimensions. This, in turn, increases the computational effort required. Indeed, the number of multiply-accumulate operations to perform the B(l)×W(l)B^(l)× W^(l) matrix multiplication of Eq. 6 is Nin(l)⋅(G+P)⋅M⋅Nout(l)N^(l)_in·(G+P)· M· N^(l)_out. Figure 4: Cox-de Boor evaluation triangle for B-splines b3,3b_3,3 and b4,3b_4,3 of degree P=3P=3. Each row corresponds to the recursive computations of bi,P(x)b_i,P(x) from bi,P−1(x)b_i,P-1(x) and bi+1,P−1(x)b_i+1,P-1(x). The number of required B-splines in every row decreases linearly with P, leading to the total number of operations in Table I. For the B-spline evaluation (B(l)=b(A(l))B^(l)=b(A^(l))), most KAN implementations rely on an iterative and parallel version of the Cox-de Boor formula of Eq. 2 to compute Nb=G+PN_b=G+P B-splines of degree P. Figure 4 conceptually illustrates it for only two B-splines, b3,3b_3,3 and b4,3b_4,3, of Figure 2. As shown in the first row of the figure, the algorithm starts by evaluating the five B-splines of degree 0 (cf. Eq. 3) contributing to b3,3b_3,3 and b4,3b_4,3, then iteratively at each row d it evaluates the B-splines of degree d according to Eq. 3 until the row d=Pd=P. In the general form, it starts with G+2PG+2P B-splines of degree d=0d=0. Then, iteratively at each row d, it evaluates the G+2P−dG+2P-d B-splines of degree d until the row d=Pd=P. Under the hypothesis that the reciprocals of the grid differences (ti+P−tit_i+P-t_i, ti+P+1−ti+1t_i+P+1-t_i+1) are precomputed, four multiplications are needed per B-spline evaluation. This is performed for the whole layer input matrix AlA^l of size M×Nin(l)M× N^(l)_in leading to 4MNin(l)(P(G+2P)−P(P−1)2)4MN^(l)_in(P(G+2P)- P(P-1)2) multiplications. We report in Table I the number of arithmetic multiplications (MULs) for an MLP layer versus a KAN layer. When it comes to convolutional KANs, most implementations rely on lowering the convolution operation to a matrix multiplication using the im2col technique [3]. Therefore, to obtain the arithmetic complexity of ConvKAN, we can replace NoutN_out and NinN_in in Table I respectively by CoutC_out and K2CinHoutWoutK^2C_inH_outW_out, with K the filter size, Hout,WoutH_out,W_out the output image size, and Cin,CoutC_in,C_out the input and output number of channels. From these considerations, we can estimate the computational complexity of KANs and explore the impact of quantization. To do so, we estimate the number of operations using the conventional BitOps metric [31, 19]. With such a metric, the cost of multiplying n-bit and m-bit integers can be approximated to n×mn× m binary operations. TABLE I: Number of multiplications (MULs) of [Nin,Nout][N_in,N_out] MLP and KAN layers with batch size M. MLP KAN Matrix Multiply MNoutNinMN_outN_in MNoutNin(G+P)MN_outN_in(G+P) Non Linearity (B-splines) _ 4MNin(P(G+2P)−P(P−1)2)4MN_in (P(G+2P)- P(P-1)2 ) We define the bit-widths of the activations A(l)A^(l), the intermediate tensor of B-spline evaluations B(l)B^(l), and the weights W(l)W^(l) as bwA(l)bw_A^(l), bwB(l)bw_B^(l), and bwW(l)bw_W^(l), respectively. Then, based on the number of multiplications reported in Table I, the multiplication BitOps for the a KAN layer l can be approximated as BitOpsKAN(l)=MNoutlNinlbwB(l)bwW(l)+4×MNinl[P(G+2P)−P(P−1)2]bwA(l)2 splitBitOps^(l)_KAN&=MN_out^lN_in^lbw_B^(l)bw_W^(l)+\\ &4× MN_in^l [P(G+2P)- P(P-1)2 ]bw_A^(l)^2 split (7) The last term in Eq. 7, bwA(l)2bw^2_A^(l), results from the Cox-de Boor recursion, where each multiplication has both operands involving activations (a), hence having the same bitwidth bwA(l)bw_A^(l) (see Eq. 3), unlike the matrix multiply involving B-spline and weights, hence having bit-width bwB(l)bw_B^(l) and bwW(l)bw_W^(l), respectively. In comparison, the BitOps resulting from multiplications in the MLP layer does not include the additional term due to B-spline evaluation: BitOpsMLP(l)=MNoutlNinlbwA(l)bwW(l)BitOps^(l)_MLP=MN_out^lN_in^lbw_A^(l)bw_W^(l) (8) Eq. 7 shows that the bit-width of the activations, bwA(l)bw_A^(l), only impacts the arithmetic cost of the B-spline evaluation, while the bit-width of the weights W(l)W^(l) and intermediate tensor B(l)B^(l) (i.e., bwW(l)bw_W^(l) and bwB(l)bw_B^(l) respectively) contribute to the arithmetic cost of matrix multiplication. In this paper, we study how quantizing weights, B-splines, and activations reduces BitOps and mitigates the intrinsic increase in model parameters and operations in KANs. In Section I, we further elaborate on how quantization reduces KAN computational complexity, and, building on this, how the tabulation of B-splines and learned splines further reduces it. I-B2 Computational overhead of B-spline evaluation In addition to the increased number of operations, the evaluation of the G+PG+P B-splines with the recursive Cox-de Boor formula is not accelerated efficiently on GPUs due to the interdependencies introduced by the recursion, which intrinsically serializes the computation. This directly impacts the inference latency. In particular, as further shown in Section IV, the B-spline evaluation completely dominates the latency of the forward pass. For instance, as reported in Table I, the B-spline computation accounts for up to 98% of the total inference time for MLP-based KANs, and 78–84% for convolutional models where the im2col overhead reduces the B-spline contribution. Therefore, for KANs, it is important to accelerate the evaluation of the basis functions [13, 29, 9]. Similarly to what is shown in [13] and [9] for a CiM-based and systolic-array-based accelerators, respectively, we show in Section I-B that this issue can be addressed in GPUs by replacing the recursive computations of B-splines with precomputed lookup tables (LUTs). This significantly reduces inference latency with negligible memory overhead and accuracy drop, as shown in the results reported in Section IV. Moreover, we show that dedicated hardware [9] enables further improvements in KAN efficiency thanks to low-bit quantization. Finally, in Section I-C we study the opportunities and challenges of a multiplier-free approach that uses low-bit-quantized precomputed tables to replace the entire spline computation [12]. I-C Uniform Integer Quantization Quantization has been extensively studied as an approach to reduce the memory footprint and inference latency of DNNs. In this work, we focus on uniform integer quantization, which consists of mapping a floating-point range [α,β][α,β] to an integer range [αq,βq][ _q, _q] that is representable by the considered precision (bit-width). The linear transformation to map a real number x to an integer xqx_q is parameterized with s∈s , z∈z as follows: x x =dequantize(xq,s,z)=s(xq−z) =dequantize(x_q,s,z)=s(x_q-z) (9) xq x_q =quantize(x,s,z,αq,βq) =quantize(x,s,z, _q, _q) =clip(round(1sx+z),αq,βq) =clip(round ( 1sx+z ), _q, _q) (10) where clip(x;a,b):=min(max(x,a),b)clip(x;a,b):= ( (x,a),b). The parameters s,zs,z are computed by mapping the floating-point bounds [α,β][α,β] to the quantization range bounds [αq,βq][ _q, _q] as follows: s s =β−αβq−αq = β-α _q- _q (11) z z =round(βαq−αβqβ−α) =round ( β _q-α _qβ-α ) (12) I-D Activation Function Optimization In digital neural networks, non-linear activations can be categorized into two main approaches from a computational perspective: (i) direct algorithmic implementation of the functions and (i) use of precomputed Look-Up Tables (LUT) to replace the computation. The first approach is generally implemented either by composing elementary functions (e.g., exponential, reciprocal, etc.) or by approximating them mathematically, for instance, with a polynomial. This approach has been studied in various works, especially for architectures such as recurrent neural networks (RNNs) that require multiple evaluations of non-linear activation functions [23]. However, these works primarily focus on specific functions common in neural networks, such as the tanhtanh activation function [1, 11]. In the second approach, precomputed activation function samples are stored in the LUT. The LUT is then accessed using the result of the previous layer [24, 30], and the corresponding precomputed output value is used. As KANs rely even more heavily on activation functions than RNNs, these techniques become much more critical. Indeed, a feed-forward KAN layer consists of Nin⋅NoutN_in· N_out spline functions, while ConvKAN has K2⋅Cin⋅CoutK^2· C_in· C_out functions. In turn, as already mentioned, each spline is composed of (G+P)(G+P) B-splines. In the context of the KAN activation functions, combining low-bit integer quantization and precomputed look-up tables has a twofold effect: (i) reducing the latency of computing the non-linear activation functions by reducing the amount of data to process and the computation complexity, and (i) reducing the memory cost when using precomputed look-up tables for activation functions. The remainder of the paper presents the proposed approach for combining these two methods, aiming to achieve latency and memory improvements without compromising accuracy. I KANtize Approach In this section, we explore the quantization and tabulation opportunities in KANs and combine them. Section I-A elaborates how quantization helps address the issue of increased model parameters and BitOps complexity. In particular, we discuss how quantizing the different KAN tensor components (A(l)A^(l), B(l)B^(l), and W(l)W^(l)) affects computational complexity. Section I-B illustrates how to efficiently implement the B-spline b()b() quantized tabulation scheme that helps address the inference bottleneck of the recursive B-spline evaluation. Finally, Section I-C illustrates the tabulation of the learned splines ϕ()φ() (see Eq. 4), which completely replaces the B-spline evaluation and their linear combination depicted in Figure 1(c), and allows computing a KAN layer by visiting the tables and summing their values. I-A Exploring the Quantization of KANs’ Weights, Activations and B-splines As shown in Figure 3, from a computational perspective, weight quantization in a KAN layer is similar to MLP layers, with the coefficients of the basis functions acting as the weights (i.e., W(l)W^(l)). Hence, when we apply conventional uniform integer weight quantization techniques to the KAN’s coefficients, i.e., we reduce bwW(l)bw_W^(l) in Eq. 7, this significantly reduces the memory cost of weights and the BitOps complexity of the matrix multiply. In MLPs, quantizing both weights W(l)W^(l) and activations A(l)A^(l), i.e., reducing bwW(l)bw_W^(l) and bwA(l)bw_A^(l) in Eq. 8, not only further reduces the memory cost and BitOps of the matrix multiplication but also enables more efficient hardware utilization, allowing the use of integer arithmetic pipelines or integer tensor cores. In KANs, since the B-spline outputs are used in the matrix multiplication, to achieve similar benefits we quantize the intermediate B-spline evaluation tensor, B(l)B^(l), i.e., we reduce bwB(l)bw_B^(l) in Eq. 7. Finally, quantization of activations A(l)A^(l) in KANs is helpful for reducing the memory and BitOps cost of computing the B-splines, i.e., reducing bwA(l)bw_A^(l) in Eq. 7. As shown in Table I, if NoutN_out is not too large, then the BitOps cost of B-spline evaluation can be as significant as the matrix multiplication. Therefore, to properly address the computational issues of KAN, it is important to consider quantizing all three W(l)W^(l), A(l)A^(l), and B(l)B^(l) and investigating their combination. The results of this exploration are reported in Section IV-A. I-B B-spline Tabulation Figure 5: Only half the values of one B-spline need to be stored (e.g., half of b3,3b_3,3, colored in black), and the other B-splines can be evaluated by translating the input and using symmetry. Figure 6: Tabulation of the canonical B-spline B0,3(x)B_0,3(x): only the first half is stored with 2k2^k entries per knot interval [ti,ti+1][t_i,t_i+1]. Figure 7: Naive tabulation example with G=15G=15, k=4k=4: uniform quantization to 2k2^k levels wastes entries in zero regions, capturing only 5 of 16 entries for the B-spline. As discussed in Section I-B, the inefficiency of the recursive B-splines evaluation can be addressed by replacing the recursive computations with precomputed LUTs, significantly reducing inference latency. As pointed out in previous literature [13, 9, 12] and depicted in the example in Figure 5 (adapted from Figure 2), in the case of a uniform grid, we only need to tabulate half of a single canonical B-spline (highlighted in black in the figure). Then, we can infer the remaining elements from simple symmetry and translation. For example, for a given input x, only the value b3,3(x)b_3,3(x) would be saved in the LUT, while the value b1,3(x)b_1,3(x) could be obtained by retrieving the value b3,3(t5−(x−t3))b_3,3(t_5-(x-t_3)), which is the same, as shown in Figure 5. This enables a single compact LUT to serve all KAN layers, regardless of model depth, grid size, or input range. When the grid is adapted to each layer’s inputs during training [20], the non-uniform knot spacing breaks translation invariance and would require separate tables. However, techniques such as least-squares fitting enable approximating a general spline using a uniform B-spline basis, achieving arbitrary precision with a sufficiently fine grid. Without loss of generality, this work focuses on uniform grids. A separate LUT would be needed only for each distinct spline degree P. To build the LUT, the canonical B-spline is evaluated at equidistant points within each knot interval of the stored half-support, with boundary points mapped exactly to zero to avoid accumulated errors outside the local support. As illustrated in Figure 6, when the tensor A(l)A^(l) is quantized to k=bwA(l)k=bw_A^(l) bits, the LUT contains 2k2^k entries per knot interval [ti,ti+1][t_i,t_i+1]. For example, k=2k\!=\!2 yields a coarse staircase of 44 entries in each interval, e.g., [1,2], while k=4k\!=\!4 gives 1616 entries and a much tighter approximation. This contrasts with a naive strategy that uniformly quantizes the full input domain into 2k2^k levels. As shown in Figure 7, most entries would fall outside the B-spline’s narrow local support, wasting storage and yielding a poor approximation, especially at higher G. The values stored in the LUT can also be quantized to h=bwB(l)h=bw_B^(l) bits, corresponding to the quantization of tensor B(l)B^(l). We refer to the resulting configuration as a k×hk×h LUT. This notation denotes the quantization bit-widths, not the table dimensions. Since the half-support spans ⌈(P+1)/2⌉ (P+1)/2 knot intervals with 2k2^k entries of h bits each, the actual LUT memory is 2k×⌈(P+1)/2⌉×h2^k× (P+1)/2 × h bits. As highlighted in previous research [9, 12], tabulation further reduces the computational complexity, as it replaces the cost of recursively computing the B-spline values with the cost of fetching data from a LUT. Also, it ensures that the LUT only stores the non-zero B-spline values. However, without specialized hardware support for efficient sparsity management (e.g., as in [9]), reconstructing the complete sparse matrix is necessary to perform matrix multiplication with the coefficients. We study, for the first time, the effects of tabulating the B-splines with different k and h low-bit values in terms of the trade-off between accuracy and memory needed to store them. Furthermore, we evaluate the gains in terms of BitOps reduction and how they translate to inference latency reduction. The inference latency gains, of course, depend on the HW support. For instance, GPUs only support specific bit-widths natively (e.g., 8 bits), and would not gain in latency when using lower-bit quantizations (e.g., less than 8 bits). However, using custom hardware allows further improvement in latency when using lower-bit quantizations. The results of the study are presented in Section IV-B. I-C Spline Tabulation and Quantization Finally, we explore the impact of low-bit quantization on a multiplier-free KAN inference. Instead of obtaining the learned splines through linear decomposition in the B-spline basis followed by matrix multiplication (as shown in Figure 3), we directly tabulate each learned function independently on the extended grid domain, as also shown in [12]. Thus, given a layer of dimensions [Nin,Nout] [N_in,N_out ], the only operations remaining after tabulation of the splines are the Nin⋅NoutN_in· N_out additions that sum the values retrieved from the corresponding Nin⋅NoutN_in· N_out spline tables. Figure 8: Example of a learned spline in the single-layer KAN [784,10][784,10] for different G values. The legend shows the number of bits used for input quantization. Figure 8 shows examples of different splines tabulated by sampling the inputs with different bit-widths and for different G values. For example, when inputs are quantized with bwA(l)=3bw_A^(l)\!=\!3 bits, the resulting table has 23=82^3=8 entries, each storing the spline output value at the corresponding quantized input level (red curve). Similar to B-spline tabulation, the values stored in the spline tables can also be quantized to For comparison, the original FP32 coefficients require (G+P)⋅32(G+P)· 32 bits per connection. further reduce memory cost. Indeed, the memory cost is significant in this case because the table dimension scales with the layer dimensions. A single KAN layer leads to Nin⋅Nout⋅2(bwA(l))⋅hN_in· N_out· 2^(bw_A^(l))· h stored bits when using h-bit quantized spline values and 2(bwA(l))2^(bw_A^(l)) table entries. For configurations where 2(bwA(l))⋅h<(G+P)⋅322^(bw_A^(l))· h<(G+P)· 32, the spline table would actually be smaller than the FP32 W(l)W^(l) coefficients it replaces. The quantization parameters can be computed without requiring any calibration, which is a direct benefit of the local support property of B-splines. Since all individual B-spline functions are zero outside their respective grid intervals (as illustrated in Figure 2), any linear combination of these splines will also be zero outside the overall grid range. This characteristic enables us to utilize the defined grid range for computing the quantization parameters, thereby simplifying the process by eliminating the need for calibration data. For example, as shown in Figure 8, all splines of a given layer share the same grid and are zero outside the extended grid range. Therefore, the quantization range for the input can be set directly to the grid bounds, since any activation falling outside this range would have zero contribution regardless of its actual value. Furthermore, the same grid is often used across the KAN layers. Thus, the same quantization parameters are used to map the previous layer’s output to an integer for addressing the spline tables. Scalability of spline tabulation: While spline tabulation eliminates multiplications, the total number of spline tables scales with ∑lNin(l)⋅Nout(l) _lN_in^(l)· N_out^(l), which grows quickly with model size. The implications of this scaling for both GPU inference latency and FPGA resource cost are studied in Section IV-C2 and Section IV-C3, respectively. IV Experimental Results In line with previous work on KANs [20, 10, 12, 9], in our evaluation we focus on the two state-of-the-art KAN families, MLP-based and ConvKAN, that we can feasibly train for classification on MNIST and CIFAR-10, given the lack of open-source pretrained KAN models. For MNIST classification, we use MLP-based KANs and a ConvKAN model based on LeNet [16] with two ConvKAN layers followed by a single feed-forward KAN layer. For CIFAR-10, we study two simple convolutional models, CNN3 and CNN4, with three and four ConvKAN layers, respectively, and a ResNet18-based model in which convolutions are replaced by ConvKAN layers and thus called ResKAN18. Table I summarizes the models, datasets, and number of parameters. TABLE I: Models used in the evaluation (P=3P\!=\!3, G=3G\!=\!3). Dimensions denote layer widths for KAN models and channel count for ConvKAN models (3×33×3 kernels unless noted). Model Type Dataset Dimensions Params KANMLP1 KAN MNIST [784,10][784,10] 47 K KANMLP2 KAN MNIST [784,64,10][784,64,10] 305 K LeKAN ConvKAN MNIST [1,6,16][1,6,16] (5×55×5) 39 K CNN3 ConvKAN CIFAR-10 [3,32,64,128][3,32,64,128] 560 K CNN4 ConvKAN CIFAR-10 [3,32,64,128,512][3,32,64,128,512] 4.1 M ResKAN18 ConvKAN CIFAR-10 ResNet18 67 M For all models, we consider that all KAN layers in a given model have the same G and P parameters and a uniform grid that is not updated during training. Unless otherwise specified, all models were trained without using the bias based on SiLU, suggested by [20]. This bias is, in fact, an independent MLP branch, and thus significantly impacts the training and the resulting weight distribution between actual basis function coefficients and the scaling used for SiLU. In this paper, we focus on studying the behavior of the spline-based KAN operation itself. In this section, we report the results of the described quantization and tabulation approaches on the models. Section IV-A reports the results of quantization in terms of the trade-off between accuracy and BitOps reduction. Section IV-B studies the trade-off between accuracy and memory reduction thanks to B-spline tabulation, and discusses the results concerning the improvements in inference latency obtained through tabulation. Finally, Section IV-C examines these same aspects for spline tabulation and the associated scalability challenges. IV-A Quantization Space Exploration: Accuracy vs BitOps (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 9: Accuracy vs BitOps trade-off results of KAN quantization. Single component graphs (a,b,c,g,h,i) report results when quantizing only one component of the KAN layer at a time, i.e., W(l)W^(l), A(l)A^(l), and B(l)B^(l). For different components, different line styles are used, and for different bit widths, different shapes. All component graphs (d,e,f,j,k,l) show the Pareto front of results obtained through quantizing multiple components of the KAN layer at the same time. The bit-widths of W(l)W^(l), A(l)A^(l), and B(l)B^(l) are encoded by marker colors, sizes, and shapes, respectively. As mentioned in Sections I-B and I-A, the quantization of the different tensor components of the KAN layer W(l)W^(l), A(l)A^(l), and B(l)B^(l) impacts the computational cost (measured in BitOps using Eq. 7) and the model’s accuracy differently. To isolate the effect of each, we first quantize only one component at a time while holding the other components at their baseline precision of 32 bits. Figures 9 (a,b,c,g,h,i) report accuracy vs BitOps results for the six models described in Table I when quantizing only one component of the KAN layer at a time, i.e., W(l)W^(l), A(l)A^(l), and B(l)B^(l). As shown in these figures, the weight matrix W(l)W^(l) is the most sensitive to quantization, consistent with previous findings [10]. Concerning the activation matrix A(l)A^(l), we observe it is less sensitive overall than W(l)W^(l) (with the exception of LeKAN), but contributes less to BitOps reduction, as anticipated in Section I. Finally, we characterize, for the first time, the quantization sensitivity of B(l)B^(l), the intermediate tensor of B-spline activations, which proves quite robust. For instance, the CNN3 model, trained on CIFAR-10, has a sharp accuracy drop of ≈12%≈ 12\% for W(l)W^(l) quantized to 5 bits, whereas it only drops by 2%2\% for A(l)A^(l) at the same bit-width and practically shows no degradation for B(l)B^(l) quantized to 5 and even 4 bits. More generally, all models maintain nearly full accuracy even when B(l)B^(l) is quantized to 3 or 4 bits, with CNN3 being the most sensitive (≈3.5%≈ 3.5\% drop at 3 bits). Moreover, quantizing B(l)B^(l) also greatly contributes to BitOps reduction. The CNN and ResKAN models exhibit the largest relative savings. For instance, quantizing B(l)B^(l) to 33-bits in the ResKAN18 model reduces BitOps from ≈4×109≈ 4× 10^9 to ≈6×108≈ 6× 10^8, i.e., almost one order of magnitude reduction. The CNN4 model shows a similarly strong drop in BitOps from 4.83×1094.83× 10^9 to 6.5×1086.5× 10^8. Then, we quantize all the tensor components of the KAN layer, W(l)W^(l), A(l)A^(l), and B(l)B^(l), at the same time, and explore all combinatorial bit-width reduction possibilities. In Figure 9 (d,e,f,j,k,l), we report the obtained Pareto-front in terms of Accuracy vs BitOps trade-off, i.e., the best quantization combinations of W(l)W^(l), A(l)A^(l), and B(l)B^(l). For easier comparison, we align the x-axis values across graphs showing the same model (e.g., a and d). As expected, across all tested models, the combined quantization yields a consistent reduction in BitOPs compared to quantizing the KAN components one at a time, without loss of accuracy. For example, for ResKAN18, while quantizing the single B(l)B^(l) component allowed reducing the BitOPs by almost an order of magnitude (from ≈4×109≈ 4× 10^9 to ≈6×108≈ 6× 10^8) without accuracy loss (Figure 9(i)), the combined quantization allowed further reduction of more than 60% (≈2×108≈ 2× 10^8) without accuracy loss (Figure 9(l)). In general, we observe that solutions providing both high accuracy and BitOps reductions typically use W(l)W^(l) quantization with 5-8 bits (i.e., red, purple, and pink colors in the graphs), while B(l)B^(l) quantization values are most of the time as low as 3 bits (i.e., pentagon shape in the graphs), and A(l)A^(l) quantization values typically range from 5 to 8 bits with some models going as low as 4 bits (e.g., KANMLP1 KANMLP2 and LeKAN). W(l)W^(l) quantization values lower than 5 (i.e., green and orange colors) mostly entail substantial accuracy drops, as also happens for A(l)A^(l) quantization values lower than 4 or 5. IV-B B-Spline Tabulation Space Exploration In this subsection, we explore the space of possible B-Spline tabulation opportunities and show the effects in terms of accuracy vs memory tradeoff (Sec. IV-B1), accuracy vs computational complexity (Sec. IV-B2), inference latency on GPUs and a systolic-array-based hardware accelerator, i.e., KAN-SAs [9] (Sec. IV-B3), and area and frequency scaling of the latter (Sec. IV-B4). IV-B1 Accuracy vs Memory tradeoff In Figure 10, we report the Pareto front of accuracy versus LUT memory for each model with fixed 8-bit weights W(l)W^(l), varying the addressing bit-width bwA(l)bw_A^(l) and stored value bit-width bwB(l)bw_B^(l). Some plots do not report solutions with higher bit widths (e.g., KANMPL1) since they were Pareto-dominated by the reported low-bit-width ones. For all models, bwB(l)bw_B^(l) can be reduced to 3 or 4 bits with minimal accuracy loss, with associated LUT memory gains up to more than one order of magnitude when moving from 8 bits (X shape) to 3 (pentagon) or 2 (dot) bits, confirming the robustness of the B-spline activations already observed in Figure 9. The addressing bit-width bwA(l)bw_A^(l) has the largest impact on LUT memory due to the exponential scaling (2(bwA(l))2^(bw_A^(l)) entries), but also on accuracy: reducing bwA(l)bw_A^(l) below 5 bits causes significant degradation for most models. The MNIST models are the most tolerant, maintaining near-baseline accuracy even at bwA(l)=4bw_A^(l)\!=\!4 with bwB(l)=3bw_B^(l)\!=\!3, while the convolutional models degrade faster at low bwA(l)bw_A^(l) values. (a) (b) (c) (d) (e) (f) Figure 10: Accuracy vs LUT Memory trade-off for B-spline tabulation with 8-bit weights W(l)W^(l). The graphs show the Pareto front of accuracy versus LUT memory cost for different bit-widths of A(l)A^(l) (marker size) and stored values B(l)B^(l) (marker shape). IV-B2 Accuracy vs Computational Complexity As mentioned, B-spline tabulation also further reduces computational complexity and runtime. Concerning computational complexity, in Figure 11, we report the Pareto front of accuracy versus BitOPs when combining B-spline tabulation with quantization. Compared to Figure 9, the elimination of the Cox-de Boor evaluation shifts the Pareto front to significantly lower BitOPs, as the only remaining computational cost is the weight matrix multiplication. For instance, in Figure 9(l) for ResKAN18, it was possible to reduce BitOPs by one order of magnitude (from ≈7×108≈ 7× 10^8 to ≈7×107≈ 7× 10^7) without accuracy loss. Considering both quantization and tabulation, the total BitOPs reduction from the FP32 baseline is substantial. For ResKAN18, from ≈4×109≈ 4× 10^9 in Figure 9(i) to ≈7×107≈ 7× 10^7 in Figure 11(l), i.e., more than 50×50× reduction without accuracy loss. (a) (b) (c) (d) (e) (f) Figure 11: Accuracy vs BitOPs trade-off for B-spline tabulation combined with quantization. By replacing Cox-de Boor evaluation with a LUT, the B-spline computational cost is eliminated, leaving only the matrix multiplication. The bit-width of W(l)W^(l) is encoded by marker color, that of A(l)A^(l) (LUT addressing) by marker size, and that of B(l)B^(l) (LUT stored values) by marker shape. IV-B3 Latency Improvement Finally, in Table I, we report the inference speedup on GPU obtained thanks to B-spline tabulation compared with the direct evaluation using the recursive Cox-de Boor formula (Eq. 2). As shown in the table, B-spline tabulation consistently accelerates inference across different grid size G values and model architectures, from single- and two-layer KAN MLPs for MNIST classification, to ConvKAN models based on LeNet and deeper CNNs, and even ResNet-based ConvKAN. Tabulation yields 2.112.11–2.87×2.87× total speedup for G=3G\!=\!3 and up to 3.36×3.36× for G=5G\!=\!5. However, GPUs only support specific bit-widths natively (e.g., 8-bit integer) and do not benefit from further reducing the B-spline table precision below 8 bits. Custom hardware accelerators such as KAN-SAs [9] can leverage lower-bit B-spline tables. Reducing the precision of the stored values shrinks each processing element (PE), allowing a larger systolic array to fit within the same area budget. TABLE I: GPU time (ms) on RTX3090 for the full test set (10 000 samples) with 8-bit B-spline tabulation (256 entries), measured with torch.profiler. P=3P\!=\!3 for all models. Model G Baseline BSP % BSP Tab. Speedup (ms) (ms) MNIST KANMLP1 33 13.913.9 9696 4.84.8 2.87×2.87× 55 17.617.6 9797 5.25.2 3.36×3.36× KANMLP2 33 15.215.2 9797 5.45.4 2.83×2.83× 55 19.319.3 9898 5.95.9 3.29×3.29× LeKAN 33 763.3763.3 8181 359.8359.8 2.12×2.12× 55 923.9923.9 8484 373.3373.3 2.47×2.47× CIFAR-10 CNN3 33 769.8769.8 8080 355.8355.8 2.16×2.16× 55 932.3932.3 8383 373.0373.0 2.50×2.50× CNN4 33 898.7898.7 7878 426.5426.5 2.11×2.11× 55 1086.21086.2 8181 446.5446.5 2.43×2.43× ResKAN18 33 6499.36499.3 8181 2977.42977.4 2.18×2.18× 55 7969.47969.4 8484 3184.23184.2 2.50×2.50× To quantify this effect on custom hardware, we synthesize a TPUv1-like [14] accelerator using a KAN-SAs-style weight-stationary systolic array [9] as its compute core on an SQRL Acorn CLE-215+ board featuring a Xilinx Artix-7 xc7a200t FPGA. All configurations use 8-bit weights and activations, G=5G\!=\!5, P=3P\!=\!3, and 256-entry B-spline lookup tables; only the B-spline table value bit-width varies. In the KAN-SAs architecture, each PE stores a local copy of the B-spline lookup table. Thus, reducing the bit-width of the stored values shrinks each PE, freeing FPGA slices for additional PEs. TABLE IV: FPGA resource utilization and average measured inference time over 5 MNIST-shaped MLP workloads [784,n,10][784,n,10] with n∈32,64,128,192,256n∈\32,64,128,192,256\ (batch = 10 000) with varying B-spline table bit-width (G=5G\!=\!5, P=3P\!=\!3, 50 MHz). B Array Slice LUTs Slices C Time (bits) size (%) (%) (ms) 8 15×1515× 15 113.7k/ 85% 32.1k/ 96% 5 206k 183.0 7 16×1616× 16 124.9k/ 93% 33.3k/ 100% 4 219k 156.4 5 18×1818× 18 119.7k/ 89% 32.8k/ 98% 3 618k 162.1 4 19×1919× 19 114.4k/ 86% 33.2k/ 99% 3 286k 157.9 3 20×2020× 20 115.2k/ 86% 32.6k/ 98% 2 969k 147.0 For each B-spline bit-width, the largest square systolic array that still fits within the FPGA’s slice budget is selected (i.e., increasing the array dimensions by one beyond those reported causes the design to exceed the available slices and fail place-and-route). Note that in this TPUv1-like configuration, weight preload and matrix computation execute sequentially: the systolic array must be fully loaded with the current weight tile before input data can be streamed through. For single-sample inference, this per-tile preload overhead grows with the array dimensions and can offset the benefit of having more PEs, particularly when the layer dimensions are small relative to the array. However, for batched inferences, the preload cost is amortized across all samples in the batch, making larger arrays strictly beneficial. For each bit-width (B), Table IV reports the largest array dimension achieved, the FPGA resource utilization (Slice LUTs and Slices), the compute cycles (C), and average measured inference time over 5 MNIST-shaped MLP workloads [784,n,10][784,n,10] with n∈32,64,128,192,256n∈\32,64,128,192,256\ on the full test set (10 000 samples). As shown, reducing the B-spline precision from 8 to 7 bits allows scaling the systolic array from 15×1515× 15 to 16×1616× 16, yielding a measured 1.17×1.17× average speedup. Reducing the precision further to 3 bits enables a 20×2020× 20 array, achieving a measured 1.24×1.24× average speedup over the 8-bit baseline. The gap between estimated and measured speedup for larger arrays is due to DMA transfer overhead, which grows with the array dimensions due to memory bus alignment constraints. TABLE V: Maximum clock frequency for a 16×1616× 16 KAN-SAs systolic array with varying B-spline table bit-width on FPGA (W=8W\!=\!8, G=5G\!=\!5, P=3P\!=\!3). B Slice LUTs Max Freq (bits) (%) (MHz) 8† — — 7 124.9k / 93% 50 6 118.1k / 88% 60 5 97.9k / 73% 70 4 85.2k / 64% 75 3 79.1k / 59% 75 †exceeds available slices Beyond throughput, the area savings from B-spline quantization also enable higher clock frequencies. To isolate this effect, Table V reports the maximum clock frequency achievable for a fixed 16×1616× 16 array with each B-spline bit-width. The 8-bit configuration cannot complete place-and-route for a 16×1616× 16 array. Reducing the bit-width to 7 bits allows the design to fit, but routing congestion at 93% LUT utilization limits the clock to 50 MHz. Further quantization progressively reduces utilization and relieves routing pressure, enabling higher clock frequencies: 60 MHz at 6 bits, 70 MHz at 5 bits, and 75 MHz at 3 and 4 bits, a 50% improvement over the 7-bit baseline. Alternatively, the freed resources could be used for architectural enhancements such as double-buffered weight preload to overlap tiling overhead with computation, or deeper on-chip buffers to hide memory latency. (a) (b) (c) (d) (e) (f) Figure 12: Accuracy vs Spline Table Memory trade-off for spline tabulation. The graphs show the Pareto front of accuracy versus total spline table memory cost for different bit-widths of A(l)A^(l) (marker size) and stored value precision h (marker shape). To validate this, we run the 3-bit 16×1616× 16 array at 75 MHz and measure the inference time for the KAN MLP [784,64,10][784,64,10] on the full MNIST test set. Compared to the 7-bit 16×1616× 16 baseline at 50 MHz, the measured time drops from 185185 ms to 129.55129.55 ms, achieving a 1.43×1.43× speedup. Note that the achievable frequency remains limited by the high fanout of the control signals across the systolic array, as the KAN-SAs design targets ASIC and is not optimized for FPGA. IV-B4 KAN-SAs Area and Frequency Scaling To evaluate the impact of B-spline quantization on hardware independently of system-level effects, we synthesize the KAN-SAs 16×1616× 16 systolic array in isolation using Synopsys Design Compiler targeting ST 28nm FD-SOI technology. Table VI reports the cell area and maximum achievable clock frequency for each configuration. Halving both the B-spline and weight precision from 16 to 8 bits reduces the cell area by 60% (from 1 426k to 574k μm2μ m^2 at 1 GHz) and raises the maximum frequency from 833 to 1 111 MHz. Further reducing only the B-spline precision from 8 to 3 bits yields an additional 30% area reduction (to 403k μm2μ m^2) and increases the maximum frequency to 1 250 MHz, a 50% advantage over the 16-bit baseline. Overall, the combined reduction from B=16B\!=\!16, W=16W\!=\!16 to B=3B\!=\!3, W=8W\!=\!8 achieves a 72% area saving. TABLE VI: ASIC synthesis results for a 16×1616× 16 KAN-SAs systolic array with varying B-spline and weight precision (ST 28nm FD-SOI, G=5G\!=\!5, P=3P\!=\!3). B W Area @ 1 GHz Max frequency (bits) (bits) (μm2μ m^2) (MHz) Area (μm2μ m^2) 16 16 1 425 590† 833 1 479 594 8 8 574 102 1 111 596 850 7 8 541 978 1 176 581 353 6 8 515 865 1 176 532 536 5 8 470 712 1 250 489 225 4 8 433 345 1 333 469 248 3 8 403 158 1 250 434 962 †Timing violated at 1 GHz; area shown for comparison. IV-C Spline Tabulation space exploration In this subsection, we explore the space of possible quantized Spline tabulation opportunities and show the effects in terms of accuracy vs. memory trade-off (Sec. IV-C1) and inference latency (Sec. IV-C2). IV-C1 Accuracy vs Memory tradeoff of Spline Tabulation Figure 12 shows the Pareto front of accuracy versus total spline table memory for all models. The vertical dashed line indicates the total FP32 coefficient memory ∑lNin(l)⋅Nout(l)⋅(G+P)⋅32 _lN_in^(l)· N_out^(l)·(G+P)· 32 bits. Tabulating a whole KAN leads to ∑lNin(l)⋅Nout(l)⋅2(bwA(l))⋅h _lN_in^(l)· N_out^(l)· 2^(bw_A^(l))· h stored bits when using h-bit quantized spline values and 2(bwA(l))2^(bw_A^(l)) table entries, as already mentioned in Section I-C for one layer. For the MNIST models (KANMLP1, KANMLP2, LeKAN), several Pareto-optimal configurations lie to the left of this line with less than 1% accuracy drop, meaning the spline tables require less storage than the coefficients they replace. Specifically, bwA(l)bw_A^(l) can be reduced to as few as 4 bits and h to 6 bits with negligible accuracy loss, consistent with prior studies [12]. For CIFAR-10, the sensitivity is model-dependent. CNN3, the smallest CIFAR-10 model, requires bwA(l)≥7bw_A^(l)≥ 7 bits to stay within ∼1% 1\% of full precision, while the deeper ResKAN18 is more sensitive to h, needing h≥7h≥ 7 bits. Overall, under per-tensor post-training quantization alone, the table memory required to maintain accuracy exceeds the original coefficient storage by a wide margin, making spline tabulation impractical for larger models without additional compression. IV-C2 Latency Improvement TABLE VII: GPU time (ms) on RTX3090 for spline tabulation with 256 entries and 32-bit values, measured with torch.profiler. P=3P\!=\!3 for all models. Model G Baseline Spline Tab. Speedup (ms) (ms) MNIST KANMLP1 33 13.913.9 1.91.9 7.3×7.3× 55 17.617.6 1.91.9 9.1×9.1× KANMLP2 33 15.215.2 6.46.4 2.4×2.4× 55 19.319.3 6.46.4 3.0×3.0× LeKAN 33 763.3763.3 201.0201.0 3.8×3.8× 55 923.9923.9 200.8200.8 4.6×4.6× CIFAR-10 CNN3 33 769.8769.8 1436.51436.5 0.5×0.5× 55 932.3932.3 1448.01448.0 0.6×0.6× CNN4 33 898.7898.7 6293.96293.9 0.1×0.1× 55 1086.21086.2 6517.46517.4 0.2×0.2× ResKAN18 33 6499.36499.3 5672.55672.5 1.1×1.1× 55 7969.47969.4 5692.35692.3 1.4×1.4× Table VII reports the inference latency of spline tabulation compared to the baseline recursive B-spline-based execution. As discussed in Section I-C, spline tabulation replaces the entire KAN layer with Nin⋅NoutN_in· N_out table lookups from the precomputed spline tables, followed by a summation along NinN_in to produce each output, eliminating both the Cox-de Boor recursion and the coefficient matrix multiply. Since each lookup contributes a single addition, the operation is inherently memory-bound, unlike the matrix multiply, which benefits from highly optimized GEMM implementations. Thus, spline tabulation yields a net speedup only if the time saved by avoiding B-spline evaluation exceeds the cost of using memory-bound lookups instead of efficient matrix multiplication. For small models (KANMLP1, KANMLP2, LeKAN), spline tabulation yields speedups of 2.42.4–9.1×9.1× on GPU, as shown in Table VII. Indeed, the spline tables remain reasonably small (see Figure 12), and, in the baseline version, the B-spline dominates the whole computation (see Table I, BSP% column). However, for larger models (CNN3, CNN4), the speedup drops significantly (down to 0.5×0.5× for CNN3 and 0.1×0.1× for CNN4), while ResKAN18 achieves a 1.4×1.4× speedup. By analyzing this phenomenon, we observe that speedups depend on model depth and width: the baseline’s B-spline recursive evaluation is sequential and scales poorly with depth (requiring multiple kernel runs per layer). Conversely, increasing the layer width has little impact on its computation time, since GPUs can efficiently parallelize the element-wise operations across the width of each layer. Concerning the ConvKAN layer’s spline tables, their indexing performs K2⋅Cin⋅Cout⋅Hout⋅WoutK^2· C_in· C_out· H_out· W_out scattered memory reads. A deep model distributes reads across many independent kernel runs while maintaining good memory locality, whereas increasing layer width degrades memory throughput by concentrating lookups into a single kernel. Therefore, ResKAN18 requires more B-spline time due to its deeper architecture (2020 layers), despite narrower layers, as confirmed in Figure 13. CNN4’s final layer (resolution 8×88× 8) requires 6464 accesses per table per sample, whereas ResKAN18’s smaller layers (2×22× 2) require only 44 accesses. In summary, spline tabulation is slower than the baseline recursive B-spline for CNN3 and CNN4 because the cumulative lookup cost exceeds the savings from reduced recursion. Figure 13: CUDA kernel time breakdown for baseline (Cox-de Boor) vs spline tabulation on CNN3, CNN4, and ResKAN18 (G=3G\!=\!3, P=3P\!=\!3, RTX3090). IV-C3 Scalability of Spline Tabulation on FPGA Figure 14: Estimated FPGA LUT cost of spline tabulation [12] for the models studied in this paper. The dashed red line indicates the capacity of a Virtex UltraScale+ FPGA. Horizontal green lines report KAN-SAs systolic array [9] configurations, labeled by B-spline coefficient precision B, weight precision W, activation precision A, and array dimensions. The Nin⋅NoutN_in· N_out scaling of spline tabulation, which limits GPU speedups for larger models (as shown in Table VII), also poses a challenge for FPGA implementations. In an approach such as [12], each spline connection requires independent FPGA LUTs, leading to a total resource cost proportional to ∑lNin(l)⋅Nout(l) _lN_in^(l)· N_out^(l). The work [12] explored spline tabulation implementations on various FPGAs, including mid-range and high-end devices such as Zynq UltraScale+ and Virtex UltraScale+. Figure 14 shows the estimated FPGA LUT cost for the models studied in this paper, using an empirical cost of approximately 99 FPGA LUTs per connection derived from the synthesis results reported in the ablation study in [12] for 6-bit addressed, 8-bit valued spline tables on a Xilinx Virtex UltraScale+, a high-end FPGA family. The graph shows that the spline tabulation approach remains viable for models of limited dimensions, such as KANMLP1, KANMLP2, LeKAN, and CNN3. However, it does not scale well for larger models, such as CNN4 and ResKAN18, exceeding the available resources by orders of magnitude on a high-end FPGA. Using aggressive pruning (∼98% 98\%) combined with quantization-aware training via Brevitas [21], the work [12] manages to reduce the cost of a KANMLP2 variant to 38093809 LUTs on a Zynq UltraScale+, albeit with 1-bit input addressing and only after pruning and re-training. Whether such extreme pruning is feasible on more complex tasks and datasets remains to be determined, thus leaving open questions about the scalability of the spline tabulation approach. In contrast, the KAN-SAs-like systolic array [9] accelerators have a constant area footprint regardless of model size, as shown by the horizontal lines in Figure 14. V Conclusion Kolmogorov-Arnold Networks (KANs) have attracted much attention for their promise of better parameter efficiency and interpretability than Multi-Layer Perceptrons (MLPs). Nevertheless, their reliance on evaluating spline functions, often expressed as linear combinations of weighted basis splines (B-splines), poses a significant computational challenge. In this paper, we investigated the impact of low-bit quantization on KANs’ accuracy as well as its benefits in reducing computational complexity and improving hardware efficiency. Our experiments revealed that B-splines are highly robust to quantization, i.e., down to 2-3 bits without loss of accuracy, and yield substantial reductions in computational complexity, whereas the learnable weights are more sensitive to quantization. Moreover, the computational overhead introduced by B-splines recursive evaluation, can be reduced through B-spline tabulation. Hence, we evaluate the impact of quantization on tabulation and their joint benefits for computational complexity and hardware efficiency. Our experiments showed that a BitOps reduction of more than 50×50× is possible, e.g., for ResKAN18, without accuracy loss thanks to quantized B-spline tabulation. Moreover, we obtained up to 2.9×2.9× inference speedup on GPUs thanks to B-spline tabulation, and up to 36% resource reduction, 50% higher clock frequency and 1.24×1.24× inference speedup on FPGA thanks to quantized B-spline tabulation. On 28nm FD-SOI ASIC, a 72% area reduction and 50% higher maximum frequency was obtained. Finally, we explored the opportunities and limitations of a similar state-of-the-art approach tabulating the entire splines instead of the B-splines, highlighting its convenience compared to the B-spline tabulation for models of limited dimension, whereas its scalability becomes challenging for larger models. Acknowledgments This work was supported by the French National Research Agency (ANR) through the RADYAL project ANR-23-IAS3-0002. References [1] S. Bouguezzi, H. Faiedh, and C. Souani (2021) Hardware Implementation of Tanh Exponential Activation Function using FPGA. In 18th International Multi-Conference on Systems, Signals & Devices (SSD), Vol. . External Links: Document Cited by: §I-D. [2] J. Braun and M. Griebel (2009) On a constructive proof of kolmogorov’s superposition theorem. Constructive Approximation 30. Cited by: §I-A. [3] K. Chellapilla, S. Puri, and P. Simard (2006) High performance convolutional neural networks for document processing. In IEEE International Workshop on Frontiers in Handwriting Recognition, Cited by: §I-B1. [4] K. Chellapilla, S. Puri, and P. Simard (2006) High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, Cited by: §I-A. [5] H. B. Curry and S. I. J. On pólya frequency functions iv: the fundamental spline functions and their limits.. Cited by: §I-A. [6] H. B. Curry and S. I. J. On spline distributions and their limits: the pólya distribution functions.. Cited by: §I-A. [7] C. de Boor (1971) Subroutine package for calculating with b-splines. . External Links: Document, Link Cited by: §I-A. [8] I. Drokin (2024) Kolmogorov-arnold convolutions: design principles and empirical studies. arXiv: 2407.01092. Cited by: §I-A. [9] S. Errabii, O. Sentieys, and M. Traiola (2026) KAN-SAs: efficient acceleration of kolmogorov-arnold networks on systolic arrays. In IEE/ACM Design, Automation and Test in Europe Conference (DATE), Vol. . Note: preprint: https://arxiv.org/abs/2512.00055 External Links: Document Cited by: §I, §I, §I-B2, §I-B, §I-B, Figure 14, Figure 14, §IV-B3, §IV-B3, §IV-B, §IV-C3, §IV. [10] K. A. A. Fuad and L. Chen (2025) QuantKAN: a unified quantization framework for kolmogorov arnold networks. arXiv:2511.18689. Cited by: §I, §I, §IV-A, §IV. [11] Z. Hajduk and G. R. Dec (2023) Very High Accuracy Hyperbolic Tangent Function Implementation in FPGAs. IEEE Access 11 (). External Links: Document Cited by: §I-D. [12] D. Hoang, A. Gupta, and P. C. Harris (2026) KANELÉ: kolmogorov–arnold networks for efficient lut-based evaluation. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), External Links: ISBN 9798400720796, Document Cited by: §I, §I, §I-B2, §I-B, §I-B, §I-C, Figure 14, Figure 14, §IV-C1, §IV-C3, §IV. [13] W. Huang, J. Jia, Y. Kong, F. Waqar, T. Wen, M. Chang, and S. Yu (2025) Hardware Acceleration of Kolmogorov-Arnold Network (KAN) for Lightweight Edge Inference. In 30th Asia and South Pacific Design Automation Conference, External Links: ISBN 979-8-4007-0635-6 Cited by: §I, §I, §I-B2, §I-B. [14] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borber, et al. (2017) In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), Cited by: §IV-B3. [15] A.N. Kolmogorov (1956) On the representation of continuous functions of several variables as superpositions of continuous functions of a smaller number of variables. Dokl. Akad. Nauk 108(2). Cited by: §I-A. [16] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11). External Links: Document Cited by: §IV. [17] C. Li, X. Liu, W. Li, C. Wang, H. Liu, and Y. Yuan (2024) U-KAN makes strong backbone for medical image segmentation and generation. arXiv preprint arXiv:2406.02918. Cited by: §I. [18] L. Li (2024) X-KANeRF: kan-based nerf with various basis functions. Note: https://github.com/lif314/X-KANeRF Cited by: §I-A. [19] Y. Li, X. Dong, and W. Wang (2020) Additive powers-of-two quantization: an efficient non-uniform discretization for neural networks. arXiv: 1909.13144. Cited by: §I-B1. [20] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, and M. Tegmark (2024) KAN: kolmogorov-arnold networks. arXiv:2404.19756. External Links: 2404.19756 Cited by: §I, §I-A, §I-A, §I-B1, §I-B, §IV, §IV. [21] A. Pappalardo (2025) Xilinx/brevitas. Zenodo. Note: https://doi.org/10.5281/zenodo.3333552 External Links: Document Cited by: §IV-C3. [22] J. Park, K. Kim, and W. Shin (2024) CF-KAN: kolmogorov-arnold network-based collaborative filtering to mitigate catastrophic forgetting in recommender systems. arXiv:2409.05878. Cited by: §I. [23] B. Pasca and M. Langhammer (2018) Activation function architectures for fpgas. In 28th International Conference on Field Programmable Logic and Applications (FPL), Vol. . External Links: Document Cited by: §I-D. [24] F. Piazza, A. Uncini, and M. Zenobi (1993) Neural networks with digital lut activation functions. In International Conference on Neural Networks (IJCNN), Vol. 2. External Links: Document Cited by: §I-D. [25] T. A. Poggio, A. Banburski, and Q. Liao (2019) Theoretical issues in deep networks: approximation, optimization and generalization. CoRR abs/1908.09375. External Links: 1908.09375 Cited by: §I-A. [26] V. Starostin (2024) Convolutional kan layer. Note: https://github.com/StarostinV/convkan Cited by: §I-A. [27] C. Sudarshan, P. Manea, and J. P. Strachan (2025) A Kolmogorov–Arnold Compute-in-Memory (KA-CIM) Hardware Accelerator with High Energy Efficiency and Flexibility. preprint 10.21203/rs.3.rs-5804189/v1. Cited by: §I. [28] C. J. Vaca-Rubio, L. Blanco, R. Pereira, and M. Caus (2024-12) Kolmogorov-arnold networks (kans) for time series analysis. In 2024 IEEE Globecom Workshops (GC Wkshps), Cited by: §I. [29] Y. Wu and M. T. Arafin (2025) ArKANe: accelerating kolmogorov-arnold networks on reconfigurable spatial architectures. IEEE Embedded Systems Letters (). External Links: Document Cited by: §I, §I-B2. [30] Y. Xie, A. N. Joseph Raj, Z. Hu, S. Huang, Z. Fan, and M. Joler (2020) A twofold lookup table architecture for efficient approximation of activation functions. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28 (12). External Links: Document Cited by: §I-D. [31] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2018) DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv: 1606.06160. Cited by: §I-B1.