← Back to papers

Paper deep dive

Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation

Yuxin Tang, Zhiyuan Xin, Zhimin Ding, Xinyu Yao, Daniel Bourgeois, Tirthak Patel, Chris Jermaine

Year: 2026Venue: arXiv preprintArea: cs.MSType: PreprintEmbeddings: 78

Abstract

Abstract:A \emph{tensor-relational} computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system's ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-case \texttt{EinSum}, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-case \texttt{EinSum} so that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally.

Tags

ai-safety (imported, 100%)csms (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/13/2026, 12:57:02 AM

Summary

The paper introduces 'upper-case-lower-case EinSum', a tensor-relational notation that allows for the automatic decomposition of tensor computations into a hybrid of relational operations (for sparsity management) and high-performance numerical kernels (for dense computation). The authors propose the 'SparseEinSum' algorithm, which uses a cost model and dynamic programming to rewrite Einstein Summation expressions into this optimized tensor-relational form, demonstrating performance improvements in workloads like graph neural networks and quantum circuit simulation.

Entities (4)

Einstein Summation Notation · mathematical-notation · 99%SparseEinSum · algorithm · 95%upper-case-lower-case EinSum · formalism · 95%Relational Database System · system · 90%

Relation Signals (3)

upper-case-lower-case EinSum leverages Relational Database System

confidence 95% · In this notation, indices written in upper case are handled relationally (i.e., they are 'promoted')

upper-case-lower-case EinSum uses Numerical Kernels

confidence 95% · indices written in lower case are handled by tensor indexing (i.e., they are 'demoted')... executed using efficient numerical kernels

SparseEinSum optimizes Einstein Summation Notation

confidence 90% · The SparseEinSum algorithm combines a simple cost model that evaluates the quality of an upper-case-lower-case EinSum expression under sparsity with a dynamic programming approach to optimize the rewrite.

Cypher Suggestions (2)

Find all algorithms proposed in the paper · confidence 90% · unvalidated

MATCH (a:Algorithm)-[:PROPOSED_IN]->(p:Paper {id: 'a7603410-0474-4748-bf2c-56b26bbc5e45'}) RETURN a.name

Map the relationship between formalisms and their execution targets · confidence 85% · unvalidated

MATCH (f:Formalism)-[:LEVERAGES]->(s:System) RETURN f.name, s.name

Full Text

77,431 characters extracted from source content.

Expand or collapse full text

Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation Yuxin Tang, Zhiyuan Xin, Zhimin Ding, Xinyu Yao, Daniel Bourgeois, Tirthak Patel, Chris Jermaine Rice University Houston, Texas, USA yt33,zx58,zd21,xy38,dcb10,tp53,cmj4@rice.edu ABSTRACT A tensor-relational computation is a relational computation where individual tuples carry vectors, matrices, or higher-dimensional arrays. An advantage of tensor-relational computation is that the overall computation can be executed on top of a relational system, inheriting the system’s ability to automatically handle very large inputs with high levels of sparsity while high-performance kernels (such as optimized matrix-matrix multiplication codes) can be used to perform most of the underlying mathematical operations. In this paper, we introduce upper-case-lower-caseEinSum, which is a tensor-relational version of the classical Einstein Summation Notation. We study how to automatically rewrite a computation in Einstein Notation into upper-case-lower-caseEinSumso that computationally intensive components are executed using efficient numerical kernels, while sparsity is managed relationally. PVLDB Reference Format: Yuxin Tang, Zhiyuan Xin, Zhimin Ding, Xinyu Yao, Daniel Bourgeois, Tirthak Patel, Chris Jermaine. Automated Tensor-Relational Decomposition for Large-Scale Sparse Tensor Computation. PVLDB, 19(6): 1101 - 1114, 2026. doi:10.14778/3797919.3797921 PVLDB Artifact Availability: The source code, data, and/or other artifacts have been made available at https://github.com/yuxineverforever/upper-case-lower-case-einstein- notation. 1 INTRODUCTION In machine learning, a tensor is a specialized relation, mapping a key specifying a position in a multidimensional array to a scalar. Given the close relationship between tensors and relations, almost all computations over tensors can be implemented on top of a relational database system. For example, consider the matrix multiplication chain(X×Y) 푇 × Z. This multiplication is used to implement a message passing scheme in a graph neural network [41,50]. That is, each column in X stores an embedding for a vertex, Y encodes the (directional) edges between vertices, and the goal is to sum all embeddings being passed over edges to a given vertex, and then to compute a linear transformation (represented by Z) over the sums at each vertex. If This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 19, No. 6 ISSN 2150-8097. doi:10.14778/3797919.3797921 all input matrices are stored as(row, col, value)triples, the SQL corresponding to this computation is as follows: SELECT XtimesY.row , Z.col , SUM (XtimesY.value * Z.value) AS value FROM (SELECT X.row AS col , Y.col AS row , SUM (X.value * Y.value) AS value FROM X, Y WHERE X.col = Y.row GROUP BY X.row , Y.col) AS XtimesY , Z WHERE XtimesY.col = Z.row GROUP BY XtimesY.row , Z.col Is the relational implementation effective? Unfortunately, this computation is unlikely to run well on a relational system. Imagine that a large, sparse graph represented by Y has 100 million vertices, each with 1000 neighbors, on average. The embeddings of each vertex (stored in X) are of size 8192, and the transformation Z results in a size 8192 vector. In this case, the first join of X and Y will produce approximately 820 trillion intermediate tuples, which are then aggregated down to 820 billion tuples. This result is then joined with Z, resulting in 6,000 trillion tuples, a debilitating number. One could instead run this computation directly using tensors on top of a conventional deep learning system (for example, PyTorch [30]). This would facilitate use of accelerators such as GPU, but the results are likely to be unsatisfactory. There is the problem that storingXrequires 3.2TB of GPU RAM (at half precision). This requires 40 GPUs (assuming 80GB of storage each). Then there is the problem of computation. GPUs are not meant for sparse computation, and for very sparse matrices, even state-of-the-art, hand-tuned sparse-dense GPU kernels operate at around 0.1% com- pute utilization on an A100 GPU [42]. At 0.1% compute utilization, the 1.6×10 15 floating point operations required for the first matrix multiply would require more than an hour of A100 GPU time. A sparse tensor-relational implementation. Instead, one could run this computation tensor-relationally [46], decomposing the prob- lem so that a relational system can take advantage of the sparsity inherent in the computation, but efficient dense CPU (or GPU) kernels can be used where appropriate: X (col INT , value VECTOR [8192]) Y (row INT , col INT , value DOUBLE) Z (value MATRIX [8192 , 8192]) SELECT XtimesY.row , vec_mat_mult (XtimesY.value , Z.value) AS value FROM (SELECT Y.col AS row , SUM (vec_sca_mult (X.value , Y.value )) AS value FROM X, Y WHERE X.col = Y.row GROUP BY Y.col) AS XtimesY , Z Kernels (vec_mat_mult()andvec_sca_mult()) will be very efficient, and because tuples are grouped intoVECTORandMATRIX structures, the number of intermediate tuples is radically reduced. The join of X and Y will now produce only 100 billion tuples, and the second join will produce only 100 billion tuples. Further, by running this computation on top of a relational sys- tem, memory and parallelization/distribution will be automatically managed. A distributed database system can automatically paral- lelize the computation across multiple machines, managing memory so that the “Out-Of-Memory” errors ubiquitous in modern machine learning programming are avoided. Sparse tensor-relational decompositions via upper-case-low- er-caseEinSum. In the above example, the choice was made to decompose X into row vectors, Y into scalars, and to leave Z un- decomposed. The question we address in this paper is: how can we take an arbitrary tensor computation and decompose it into a tensor- relational computation where computationally intensive portions of the computation are handled using efficient kernels (such as vec_mat_mult()) but where sparsity is handled relationally? To address this challenge, we propose a new tensor-relational variant of Einstein summation notation [4], referred to as upper- case–lower-caseEinSum. ClassicEinSumis a standard tensor calcu- lus used to express machine learning and numerical computations. In upper-case–lower-caseEinSum, expressions explicitly specify which parts of the computation are handled relationally and which parts are executed using efficient numerical kernels. Given a directed acyclic graph ofEinSumcomputations, the task of producing an equivalent, optimized tensor-relational com- putation can be reduced to automatically rewriting expressions into equivalent upper-case–lower-caseEinSumexpressions that maximize performance. In this paper, we propose an algorithm to perform this rewrite, called SparseEinSum. The SparseEinSum al- gorithm combines a simple cost model that evaluates the quality of an upper-case–lower-caseEinSumexpression under sparsity with a dynamic programming approach to optimize the rewrite. We demonstrate experimentally that SparseEinSum produces high-performing tensor-relational computations. Our experimental testbed includes a variety of sparse tensor workloads, such as large- scale graph neural networks, graph-based attention computations, and quantum circuit simulation. 2 THE UPPER-CASE-LOWER-CASE EINSUM APPROACH Assume we are given a (potentially) complex numerical computa- tion specified as a directed acyclic graph (DAG) ofEinSumexpres- sions, and our goal is to produce an optimized computation that can be executed on top of a relational system. EinSum is described in detail in Section 3. At a high level, it asks the programmer to specify a simple declarative expression describing how the entries of an output tensor are computed. For example, consider the matrix multiplicationW←X×Y. InEinSum, this computation can be expressed as: ∀푖,푘 W 푖,푘 ← ∑︂ 푗 U 푖,푗 × V 푗,푘 (1) By allowing arbitrary aggregation and scalar functions, as well as arbitrary indexing into the input tensors, the output tensor, and the aggregation domain,EinSumcan express a very broad class of tensor computations. As a result,EinSumis now widely used in ma- chine learning. Popular frameworks such as NumPy [20], JAX [18], PyTorch [30], and Dask [13] all provide implementations ofEinSum. If it is unreasonable to require programmers to writeEinSumex- pressions directly, they may instead specify computations using a PyTorch-like syntax, in which high-level operators (e.g.,matmul) act as thin wrappers around the underlyingEinSumrepresentation. There is a close relationship betweenEinSumand SQL/relational databases, in that everyEinSumexpression can be translated into SQL and implemented as a sequence of joins followed by an ag- gregation [5]. However, the standardEinSum-to-SQL translation implicitly assumes that all data are stored in fully relational form, with individual scalar values stored in tuples. If not handled care- fully, such a decomposition can result in poor performance. To this end, we introduce the concept of upper-case–lower-case EinSumnotation, which serves as the target representation for our SparseEinSum rewrite. In this notation, indices written in upper case are handled relationally (i.e., they are “promoted”), whereas indices written in lower case are handled by tensor indexing (i.e., they are “demoted”). For example, consider the task of multiplying two 8192× 8192 matrices. If we write Equation 1 as: ∀푖,푘 W 푖,푘 ← ∑︂ 퐽 U 푖,퐽 × V 퐽,푘 it corresponds to the following, tensor-relational implementation: U (J INT , valU VECTOR [8192]) V (J INT , valV VECTOR [8192]) SELECT SUM (outer_prod (U.valU , V.valV)) AS valW FROM U, V WHERE U.J = V.J Note that since퐽is the only index in upper-case notation, it is the only one to appear in the SQL code. The indices푖and푘are handled implicitly by theouter_prod()kernel function. The output tensor W is then stored as a single tuple storing the complete output 8192 × 8192 matrix. As an alternative, we can write: ∀퐼,퐾 W 퐼,퐾 ← ∑︂ 푗 U 퐼,푗 × V 푗,퐾 This corresponds to the following: U (I INT , valU VECTOR [8192]) V (K INT , valV VECTOR [8192]) SELECT U.I, V.K, inner_prod (U.valU , V.valV) AS valW FROM U, V Again, since퐼and퐾are handled relationally and푗is not,푗does not appear anywhere in the corresponding SQL; it is handled by the kernel functioninner_prod()which loops over the two vectors valUandvalValong the푗dimension. The output matrix W is stored as a set of scalars indexed by 퐼 and 퐾 . Given the upper-case-lower-case notation, the core problem then becomes: how to re-write a DAG ofEinSumexpressions into a DAG of upper-case-lower-caseEinSumexpressions that leverages a relational engine as appropriate to handle sparsity, and leverages high-performance kernels as appropriate to leverage density? To this end, the remainder of the paper is organized as follows. The next section of the paper describesEinSummore formally. Then Section 4 describes the upper-case-lower-case notation and how it can take advantage of sparsity. Section 6 describes how upper- case-lower-caseEinSumexpressions can be translated into SQL and thus executed on existing relational systems. Sections 7 and 8 describe the two major components of the SparseEinSum re- write algorithm: a cost model for predicting the utility of a re-write into upper-case-lower-caseEinSum, and then a simple dynamic programming procedure for optimizing the cost of the re-write. Experiments in Section 9 evaluate the utility of the approach. 3 EINSUM BACKGROUND The first few paragraphs of this section introduceEinSumand no- tions such as bounds and label projections, are taken from prior work [8]. We first define the notion of a tensor. We use bold upper-case (for example,U) to denote a tensor. Define the bound vector for U, denotedb U , to be a vector of integers of length푟.푟stands for “rank”. Matrices are rank-2 tensors. Next, defineI(b U ) to be the set0...b U [ 0]−1×0...b U [ 1]−1× ...×0...b U [푟 − 1]−1. This is the set of all indices or keys that obey the bound. A tensorU is then a function fromI(b U ) to the set of real numbers. SimpleEinSumexamples. We start with the classic example: ma- trix multiplication. LetUandVbe matrices with bounds[100,200] and[200,50], respectively. Then matrix multiplication is written as:∀푖,푘 ∈I ( [100, 50] ) : W 푖,푘 ← ∑︂ 푗∈I ( [200] ) U 푖,푗 × V 푗,푘 (2) We often drop the subscript on the aggregation as being un- necessary, as indices from the input tensors that are not used to index into the output tensor are, by definition, aggregated out. The bound vector for the output tensor is likewise implied by the bound vectors of the input, and can also be dropped. This is an extendedEinSumnotation because we need not nec- essarily multiply items from the input tensors, and we need not necessarily use summation. For example, we can compute the max- imum squared difference in each row between any two items in that row as: W 푖 ← max(U 푖,푗 − V 푖,푗 ) 2 . General form ofEinSum. To describe the most general form of EinSum, we define the notion of a label, which is a symbol that can be bounded to a value. We useℓ U to denote a list (vector) of labels used to index into tensor U. We also need to define the projection and permutation of a bound vector. Given two lists of labelsℓ 1 andℓ 2 , and a bound vectorb, defineb[ℓ 1 ;ℓ 2 ]to be a vector of length|ℓ 1 |, where the푖th entry isb[푗]iffℓ 1 [푖]=ℓ 2 [푗]. As an example, letb= [2,3,4]and let ℓ 1 =[푘,푖] and ℓ 2 =[푖, 푗,푘]. Then b[ℓ 1 ;ℓ 2 ]=[4, 2]. Given this, binary EinSum expressions take the general form: ∀ ℓ W ∈I ( b W ) : W ℓ W ← ⨁︂ ℓ agg ∈I ( b UV [ℓ agg ;ℓ UV ] ) ⨂︂ (︁ U ℓ U , V ℓ V )︁ (3) Here, ⨁︁ is the aggregation operator and ⨂︁ is the scalar function applied to joined values (EinSumis extended as it allows for arbitrary ⨁︁ and ⨂︁ operations). In the above expression, to denote the concatenation of two label listsℓ U andℓ V , we useℓ UV .b UV similarly denotes the concatenation of two bound vectors. Consider a more complicatedEinSumexpression. Assume that we have two tensorsUandVwith bound vectorsb U =[10,100,20] andb V =[100,20,2000]. We wish to transposeUto obtain a tensor with bound[20,10,100], then transposeVto obtain a new tensor with bound[20,100,2000], and do a batch matrix multiply for the two resulting tensors, and then sum out the batch dimension. In EinSum, this is expressed in the single expression: ∀푖,푘 ∈I([10, 2000]), W 푖,푘 ← ∑︂ 푏,푗∈I ( [20,100] ) U 푖,푗,푏 × V 푗,푏,푘 Considering the general form, we haveℓ U =[푖, 푗,푏],ℓ V =[푗,푏,푘], ℓ agg =[푏, 푗]andb UV =[10,100,20,100,20,2000]. The bound vector for the aggregation is computed asb UV [ℓ agg ;ℓ UV ]. How?ℓ agg has two labels:푏and푗. As푏occupies the third (and fifth) position in ℓ UV , and푗occupies the second (and fourth) position inℓ UV , we select the third (or fifth) item inb UV , and the second (or fourth) item results in b UV . This results in the bound vector [20, 100]. Multi-Layer Perceptron inEinSum. UsingEinSum, we are able to specify very complex computations, including modern trans- formers, by composing individualEinSumoperators together into a directed, acyclic graph of operators. For a simple example of such a DAG, the following specifies a fully-connected neural network with two weight matrices,W (1) 푖,푗 andW (2) 푖,푗 used to connect the input to a hidden layer, and then a softmax: A 푖 ← ∑︂ 푗 W (1) 푖,푗 × X 푗 ; B 푖 ← ∑︂ ∅ ReLU(A 푖 ) C 푖 ← ∑︂ 푗 W (2) 푖,푗 × B 푗 ; D ∅ ← ∑︂ 푖 exp(C 푖 ); E 푖 ← ∑︂ ∅ exp(C 푖 ) D ∅ 4 SPARSE EINSUM 4.1 Why Worry About Sparsity? EinSumis agnostic as to the implementation of the underlying tensors, but this implementation can have a significant practical effect on the efficiency of real-life calculations, especially if the underlying tensors are sparse. Consider the following two matrices: U≡ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1.42.202.1 0000 1.401.10 0000 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ V≡ ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 3.201.30 000.60 001.20 1.202.10 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ If we store these matrices in a classic, dense format such as row- major or column-major, most classical implementations of the ma- trix multiplication specified by theEinSum W 푖,푘 ← ∑︁ U 푖,푗 × V 푗,푘 , would perform 64 pairwise scalar multiplications. However, we could store these two matrices relationally, and have a much more efficient implementation of the matrix multiplication, at least in terms of the number of multiplications required. Consider repre- senting U as a set of vectors, where any vectors containing only zero values are not represented explicitly: ̄ U≡ ︁ (︁ 0, [︁ 1.42.202.1 ]︁ )︁ , (︁ 2, [︁ 1.401.10 ]︁ )︁ ︁ ̄ Uis implemented as a relation with schema(i, valU)wherevalU is of type vector[4]. V is implemented as a set of vectors: ̄ V≡ ⎧ ⎪ ⎪ ⎪ ⎪⎨ ⎪ ⎪ ⎪ ⎪ ⎩ ⎛ ⎜ ⎜ ⎜ ⎝ 0, ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 3.2 0 0 1.2 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎞ ⎟ ⎟ ⎟ ⎠ , ⎛ ⎜ ⎜ ⎜ ⎝ 2, ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1.3 0.6 1.2 2.1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎞ ⎟ ⎟ ⎟ ⎠ ⎫ ⎪ ⎪ ⎪ ⎪⎬ ⎪ ⎪ ⎪ ⎪ ⎭ (4) with schema(k, valV). The same matrix multiplication can now be implemented as a simple SQL query that performs a join with a dot product over the cross product of the two relations: SELECT U.i, V.k, dot_product (U.valU , V.valV) AS valW FROM ̄ U AS U, ̄ V AS V Note that the result is a fully relational (fully decomposed) im- plementation of the resulting matrix, with schema(i, k, valW): W≡ ( 0, 0, 7 ) , ( 0, 2, 7.55 ) , ( 2, 0, 4.48 ) , ( 2, 2, 3.14 ) Crucially, there are only 16 pairwise scalar multiplications required to perform this implementation, four of which will be invoked by the call to the dot_product() operation. Thus, only 25% as many scalar multiplications are required as the traditional way. 4.2 Promoting and Demoting Labels For any relational implementation, there will be an overhead associ- ated with pushing each tuple through the system, and this overhead may render the original matrix multiplication the best option. Still, there is clearly some level of sparsity for which this decomposed representation is preferred due to the lower computational load. For any tensor, many decompositions into tensor relations of this form are possible. Consider anEinSumexpression over tensorU with boundb U and label listℓ U . We describe a decomposition ofU by a partitioning of ℓ U into two lists,↑ℓ U and↓ℓ U : (1) The labels in↑ℓ U are promoted so that they index into the resulting relation. (2)The labels in↓ℓ U are demoted so that they index into the tensors within the relation. The decomposition ofUinduced by↑ℓ U and↓ℓ U is a database rela- tion with schema ̄ U(INT ↑ℓ U [1], INT ↑ℓ U [2], ..., TENSOR valU) In each tuple,valUis a tensor with boundb U [↓ℓ U ;ℓ U ], and the 푖th attribute↑ℓ U [푖]in the relation takes the integer values from 0 throughb U [↑ℓ U ;ℓ U ][푖] −1. Crucially, assuming that⊕and⊗ form a semiring, ifvalUis filled only with the special zero value of the semiring, the corresponding tuple can be eliminated from the resulting tensor relation, facilitating sparsity. The set of all decompositions of a tensorUin anEinSumexpres- sion defines an equivalence class of all tensor-relational implemen- tations forUin the expression, in that all of the tensor-relational implementations describe same mapping fromI(b U )to the real numbers. Let ̄ U be a decomposition ofU, defined by↑ℓ U and↓ℓ U . Then, for any indexi ∈ I(b U ), the following relational query returns a single tuple containing a scalar with the value at U i : SELECT (ISNULL (valU i[↓ℓ U ;ℓ U ] , zero)) FROM (SELECT valU FROM ̄ U WHERE ↑ℓ U = i[↑ℓ U ;ℓ U ]) LEFT OUTER JOIN ZERO Here, we assume a relationZERO(zero)that contains a single tuple with the zero value for the semiring. TheWHEREclause of the inner query uses the set of labels promoted to the relational part of the representation to select the tuple corresponding to i, and then the outerSELECTclause uses the remaining labels to index into the tensorvalUto obtain the desired scalar. If the inner query returns no results, thenvalUwill beNULLand the query result will be the zero value for the associated semiring. For a concrete example of this, re-consider V, its (possible) tensor- relational representation ̄ Vfrom this section, and theEinSum W 푖,푘 ← ∑︁ U 푖,푗 × V 푗,푘 . In this case, ̄ V corresponds to the decomposition ↑ℓ V = 푘,↓ℓ V = 푗. Consider the indexing vectori= [3,2]. In this case,i[↓ℓ V ;ℓ V ]= i[푗;푗푘]=3 andi[↑ℓ V ;ℓ V ]= i[푘;푗푘]=2. Thus, to obtain V[3, 2] in ̄ V, we would use the query: SELECT (ISNULL (valV 3 , zero)) FROM (SELECT valV FROM ̄ V WHERE k = 2) LEFT OUTER JOIN ZERO 4.3 “Upper-Case-Lower-Case” Notation Rather than explicitly listing↑ℓ U and↓ℓ U , we suggest the convention that in anEinSumexpression over tensor relations, any labels in upper-case are in↑ℓ U , and any labels in lower-case are in↓ℓ U . This de- fines the upper-case-lower-case notation. So, consider the following EinSum expression for a tensor-relational matrix multiplication: W 푖,퐾 ← ∑︂ U 푖,퐽 × V 퐽,퐾 In this case, labels 퐽,퐾 are used to index into relations, and label 푖 into tensors. So this corresponds to the implementation where U is represented as the set: ⎧ ⎪ ⎪ ⎪ ⎪⎨ ⎪ ⎪ ⎪ ⎪ ⎩ ⎛ ⎜ ⎜ ⎜ ⎝ 0, ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1.4 0 1.4 0 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎞ ⎟ ⎟ ⎟ ⎠ , ⎛ ⎜ ⎜ ⎜ ⎝ 1, ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 2.2 0 0 0 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎞ ⎟ ⎟ ⎟ ⎠ , ⎛ ⎜ ⎜ ⎜ ⎝ 2, ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 0 0 1.1 0 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎞ ⎟ ⎟ ⎟ ⎠ , ⎛ ⎜ ⎜ ⎜ ⎝ 3, ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 2.1 0 0 0 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎞ ⎟ ⎟ ⎟ ⎠ ⎫ ⎪ ⎪ ⎪ ⎪⎬ ⎪ ⎪ ⎪ ⎪ ⎭ and V is represented as the set: (0, 0, 3.2),(0, 2, 1.3),(1, 2, 0.6),(2, 2, 1.2),(3, 0, 1.2),(3, 2, 2.1) while the output W is again represented as a set of column vectors, as in Equation 4. For another example, consider the decomposition from the In- troduction (the chain of two multiplications used to implement message passing in a graph neural network). Following the conven- tion that those indexes promoted to a relational representation are given in upper case, this can be specified via the EinSum: T 퐾,푖 ← ∑︂ X 푖,퐽 × Y 퐽,퐾 ; U 퐼,푘 ← ∑︂ T 퐼,푗 × Z 푗,푘 5 DECOMPOSING EINSUM We can decompose any binaryEinSumexpression into an equiva- lent computation over tensor relations. This equivalence gives a roadmap to implement tensor-relationalEinSumover sparse inputs relationally. Using the convention of the last section of the paper, ̄ Uand ̄ Vare used to denote the tensor-relational decompositions ofUandV, respectively. Now, assume that we have an associated kernel functionK. This is a function that implements the EinSum: W ↓ℓ W ← ⨁︂ ⨂︂ (︁ U ↓ℓ U , V ↓ℓ V )︁ Re-write equation 3 as: W ℓ W ← ⨁︂ ↓ℓ agg ,↑ℓ agg ⨂︂ (︂ (︁ 휎 ↑ℓ U (︁ ̄ U )︁)︁ ↓ℓ U , (︁ 휎 ↑ℓ V (︁ ̄ V )︁)︁ ↓ℓ V )︂ (5) A bit of additional notation is introduced here. Given a binding of the labels in↑ℓ U to values in the vectoriduring the execution of the EinSumexpression,휎 ↑ℓ U (︁ ̄ U )︁ represents the application of relational selection equivalent to running the query: SELECT ISNULL (valU , zero) FROM (SELECT valU FROM ̄ U WHERE ↑ℓ U = i) LEFT OUTER JOIN ZERO Note thatzerois a tensor, filled with zeros, with boundb[↓ℓ U ;ℓ U ]. Further,↑ℓ agg and↓ℓ agg represent the aggregation labels in the orig- inal expression partitioned into two lists, depending upon whether they appear in↑ℓ U ,↑ℓ V or in↓ℓ U ,↓ℓ V . Given this, Equation 5 can be re-written as: W ℓ W ← ⨁︂ ↑ℓ agg (︃ (︂ 휎 ↑ℓ W ,↑ℓ agg (︁ ⋈︁ (︁ ̄ U, ̄ V, valW←K ( valU, valV ) )︁)︁ )︂ ↓ℓ W )︃ Here,⋈︁ (︁ ̄ U, ̄ V, valW←K(.) )︁ performs a natural join of the rela- tions ̄ U and ̄ V and computes additional attributevalWusing the kernel functionK. Define ⨁︁ ↑ℓ agg relationally so that it aggregates the tensors stored invalWby applying the⊕operator pairwise to each item in a pair of tensors, then this last expression can be re-written as W ℓ W ← ⎛ ⎜ ⎝ 휎 ↑ℓ W ⎛ ⎜ ⎝ ⨁︂ ↑ℓ agg (︁ ⋈︁ (︁ ̄ U, ̄ V, valW←K ( valU, valV ) )︁)︁ ⎞ ⎟ ⎠ ⎞ ⎟ ⎠ ↓ℓ W (6) This suggests a tensor-relational implementation for any binary EinSum expression over tensor relations: SELECT ↑ℓ W , SUM (K ( valU, valV ) ) AS valW FROM ̄ U NATURAL JOIN ̄ V GROUP BY ↑ℓ W 6 COMPILING EINSUM INTO TENSOR-RELATIONAL SQL This suggests a framework for compiling a DAG ofEinSumexpres- sions into tensor-relational algebra or tensor-relational SQL. (1) For each vertex (EinSum) expression in the DAG, re-write the EinSum into upper-case-lower-case EinSum. (2)For each resulting expression, generate SQL according to Equation 6. That is, generate aSELECT-FROM-WHERE-GROUP BY query where (a) the attributes chosen in theSELECT are those labels in↑ℓ W as well as “SUM(K ( valU, valV ) ) asvalW”; (b) theWHEREclause performs a natural join of the two input relations; (c) we group by those attributes corresponding to labels in↑ℓ W . (3)If there are mismatches in tensor-relational decomposition across decomposedEinSumexpressions, generate tensor- relational algebra/tensor-relational SQL to change repre- sentation. The last point bears some discussion. Imagine, for example, that we generated the following decomposition: T 퐾,푖 ← ∑︂ X 푖,퐽 × Y 퐽,퐾 U 푖,푘 ← ∑︂ T 푖,퐽 × Z 퐽,푘 SQL for the first expression would be: SELECT K, SUM (K 1 ( valX, valY ) ) as valT FROM ̄ X NATURAL JOIN ̄ Y GROUP BY K SQL for the second expression would be: SELECT SUM(K 2 ( valT, valZ ) ) as valU FROM ̄ T NATURAL JOIN ̄ Z The output decomposition of T from the firstEinSum(a set of row vectors, as the first [row] index is upper-case) does not match the input decomposition of T to the secondEinSum(a set of column vectors, as the second [column] index is upper case). Thus, code must be generated to repartition T between the operations. Assume the output of the firstEinSumexpression—partitioned according to T 퐾,푖 , named as ̄ T in , and we want to repartition according toT 푖,퐽 to form ̄ T. We would first decompose in the first dimension: CREATE VIEW ̄ T int (i, j, valT) AS SELECT ̄ T in .k, A.index , ̄ T in .valT[A.index] FROM ALLINTS (0, b T [1]) AS A, ̄ T in In this code, ̄ T in .valT[A.index] extracts the sub-tensor (here a scalar) at positionA.indexfrom ̄ T in .valT .ALLINTSis a table function returning all values in the specified range. And then we re-compose on the second dimension: CREATE VIEW ̄ T(j, valT) AS SELECT ̄ T int .j, STACK ( ̄ T int .valT , ̄ T int .i, 0, b T [0]) FROM ̄ T int GROUP BY ̄ T int .j Here,STACKis an aggregation function that can be used to stack scalars (or tensors) to create a higher-dimensional tensor. It accepts: (1) the value to stack, (2) the position of this value in the higher- dimensional tensor, (3) the new dimension to create, and (4) the maximum value of this dimension. In the general case, the repartition from tensor relation ̄ T in to ̄ T happens in two steps. First, there is a decomposition of the result of the firstEinSumexpression so that it is partitioned on all of the labels in either↑ℓ T in or↑ℓ T . Then there is an “un-partitioning” or aggregation to obtain ̄ T. Note that if↑ℓ T in is a subset of↑ℓ T , the aggregation can be skipped, and if↑ℓ T is a subset of↑ℓ T in the decom- position can be skipped. If↑ℓ T in =↑ℓ T , then the entire repartition can be skipped, as the partitionings are the same. 7 THE SPARSE EINSUM COST MODEL This leaves two final questions, as our goal is to find the best de- composition. (1) How do we cost a given decomposition in sparsity- aware fashion? (2) How to efficiently search the space of all decom- positions? In this section, we consider the first question. 7.1 Estimating Sizes Under Sparsity When developing a cost model, the first question we must address is: given a (potentially sparse) tensorUwith label setℓ U and a tensor-relational decomposition ̄ Udefined by↑ℓ U and↓ℓ U , how many tuples are in ̄ U? That is, if we take a tensor and decompose it into a tensor relation, can we estimate how many tuples will be contained in the relation? We define this quantity using the notation푇( ̄ U). We begin by defining two statistics that are sufficient to accu- rately estimate푇( ̄ U). First, for a label푙inℓ U , let푉(푙, U)denote the number of sub-tensors induced by푙that have at least one non-zero entry, if we were to use↑ℓ U = 푙. In other words,푉(푙, U)is the answer to the question: If we were to decomposeUby representing label 푙 relationally, how many tuples would we have? To be precise, letℓ ′ U denoteℓ U with푙removed. Then푉(푙, U)can be computed using the following EinSum expressions: V 푙 ← ∑︂ ℓ ′ U if(U ℓ U == 0, 0, 1) 푉(푙, U) ← ∑︂ 푙 if(V 푙 == 0, 0, 1) Here, the function “if (bool, val1, val2)” returnsval1ifbool is true, and val2 otherwise. Note that푉(푙, U)is in some sense equivalent to the notion of the number of distinct values for an attribute that is commonly used in relational query optimization. Why? Imagine thatUwere repre- sented purely relationally—that is, we perform a tensor-relational decomposition ofUinto ̄ Uso that↑ℓ U = ℓ U and↓ℓ U = , and remove any tuples storing the scalar value zero. Now consider at- tribute푙. If푙has푑distinct values in ̄ U, it means that were we to group all of the tuples having the same푙value into a sub-tensor of U, there would be exactly푑sub-tensors, and none of those would be composed entirely of zeros (any sub-tensors composed purely of zeros would not contribute any tuples to ̄ U, and hence would not contribute to the count 푑 ). Thus,푉(U,푙) would be 푑. Further, assume we have푇(U), which is the number of non-zero entries in tensorU. Again, this is analogous to the number of tuples in a relation. If we perform a pure relational decomposition ofU and remove tuples storing zero values, then the number of non-zero entries in the tensorUwould be the same as the number of tuples in such a pure relational decomposition. Let푛(↑ℓ U )denote the number of possible tuples in some tensor- relational decomposition ̄ UforU, induced by the promoted label set↑ℓ U This can be computed as: 푛(↑ℓ U )= ∏︂ 푙∈↑ℓ U 푉(푙, U) Then the number of tuples in the tensor-relational decomposition ̄ U induced by the set of promoted labels↑ℓ U can be estimated as: 푇(↑ℓ U )= 푛(↑ℓ U ) (︃ 1− exp (︃ −푇(U) 푛(↑ℓ U ) )︃)︃ (7) The intuition is that there are푛(↑ℓ U )possible tuples in the tensor relation ̄ U .푇(↑ℓ U )estimates the number of tuples that have tensors with at least one non-zero value, if the푇(U)non-zero values are spread uniformly-at-random throughout the tensor relation. 7.2 The Cost Model When implementing a generic, binaryEinSumexpression of the form of Equation 3 tensor-relationally, there are two operations, a join and an aggregation. There is a third operation—repartition— that may be necessary to change tensor-relational decompositions between the output of one operation, and the input of the next. We consider how to cost each of them. Costing the join. The set of labels participating in the relational joinℓ join is the list of labels in both↑ℓ U and↑ℓ V . The number of tuples resulting from the tensor-relational join can be estimated as: 푇 join (↑ℓ U ,↑ℓ V )= 푇(↑ℓ U )푇(↑ℓ V ) ∏︁ 푙∈ℓ join max(푉(U,푙),푉(V,푙)) This is an application of the standard, textbook estimator for the number of tuples resulting from a join. Note that if there are no labels common to↑ℓ U and↑ℓ V so thatℓ join is empty, it is assumed that the denominator takes the value 1. Then the cost of the tensor- relational join is estimated as: 퐶 join (↑ℓ U ,↑ℓ V )= 푇 join (↑ℓ U ,↑ℓ V )× [︁ (︁ sizeof( ̄ U)+ sizeof( ̄ V))퐶 xfer +퐶 krnel +퐶 fixed ]︁ Here,sizeofcomputes the size of a tuple from the tensor relation, in bytes,퐶 xfer is the per-byte cost to move a tuple from machine to machine (assuming a distributed environment),퐶 krnel is the cost to invoke the kernelK(typically this is the number of floating-point operations required byK, times a constant), and퐶 fixed is a fixed, per-tuple processing cost. Overall, the cost is then: (a) the cost to move tuples from both inputs into the join; plus (b) the cost to run the kernel, plus (b) a fixed cost per tuple. A rule-of-thumb is to choose the ratio퐶 xfer by checking how long it takes to transfer a byte from machine to machine (or from disk to machine in a single-machine environment), to choose퐶 krnel by checking how long it takes to run a floating point computation on the execution platform, and to choose퐶 fixed by running a single- machine computation that produces and aggregates a large number of tuples (say, by running a cross product of two relations and aggregating the result) and dividing the runtime by the number of tuples produced to obtain the per-tuple overhead. Costing the aggregation. The number of tuples resulting from the subsequent aggregation can be estimated as 푇 agg (↑ℓ U ,↑ℓ V )= min (︄ 1 2 푇 join (↑ℓ U ,↑ℓ V ), ∏︂ 푙∈↑ℓ agg ⎧ ⎪ ⎪ ⎪⎨ ⎪ ⎪ ⎪ ⎩ if 푙 ∈↑ℓ U ∧푙 ∈↑ℓ V : min(푉(U,푙),푉(V,푙)) else if 푙 ∈↑ℓ U : 푉(U,푙) else : 푉(V,푙) )︄ This is a variation on the standard, textbook estimator for the number of tuples resulting from an aggregation. Note the three cases in the second option of the “min”. The number of distinct values for each promoted grouping label is either (a) the minimum of the number of distinct values from bothUandVif the label is a join attribute; (b) the number of distinct values fromUif the label comes only fromU; (c) the number of distinct values fromV, otherwise. Then the cost of the aggregation is estimated as: 퐶 agg (↑ℓ U ,↑ℓ V )= (푇 join (↑ℓ U ,↑ℓ V )−푇 agg (↑ℓ U ,↑ℓ V ))× [︁ sizeof( ̄ W)퐶 xfer +퐶 add +퐶 fixed ]︁ This formula mirrors the formula for퐶 join , the one difference being that the output of the aggregation is푇 agg (↑ℓ U ,↑ℓ V )“aggregation groups”; reducing the 푇 join (↑ℓ U ,↑ℓ V ) 푇 agg (↑ℓ U ,↑ℓ V ) tuples in a aggregation group down to a single tuple requires moving and adding 푇 join (↑ℓ U ,↑ℓ V ) 푇 agg (↑ℓ U ,↑ℓ V ) −1 tuples, or푇 join (↑ℓ U ,↑ℓ V )−푇 agg (↑ℓ U ,↑ℓ V ) moves-and-adds overall. Costing the repartition. Finally, there is the cost to repartition tensor relation ̄ Wwhen the output partition of the expression producing ̄ W does not match the input to the consumer of ̄ W. Let↑ℓ W in be the set of promoted labels in the input to the repar- tition, and let↑ℓ W be the set of promoted labels in the target, con- suming EinSum expression. As illustrated in the SQL given in the prior section, the typical implementation of a repartition first decomposes the result of the firstEinSumexpression so that it is partitioned on all of the labels in either↑ℓ W in or↑ℓ W (while both↑ℓ W in and↑ℓ W are lists and not sets, we abuse notation and denote the list containing all labels in both as ↑ℓ W in ∪↑ℓ W ). The number of tuples in the resulting tensor relation can be estimated using푇(↑ℓ W in ∪↑ℓ W ). Note that this quantity cannot be less than either푇(↑ℓ W )or푇(↑ℓ W in ), as the intermediate decomposition uses a set of promoted labels that is a superset of both the input and output set of promoted labels. Then it is possible to compute an expression for the cost퐶 repart : 퐶 repart (↑ℓ W in ,↑ℓ W )= (︁ 푇(↑ℓ W in ∪↑ℓ W )−푇(↑ℓ W in ) )︁ (︁ sizeof( ̄ W int )퐶 xfer +퐶 fixed )︁ + (︁ 푇(↑ℓ W in ∪↑ℓ W )−푇(↑ℓ W ) )︁ (︁ sizeof( ̄ W int )퐶 xfer +퐶 fixed )︁ = (︁ 2푇(↑ℓ W in ∪↑ℓ W )−푇(↑ℓ W in )−푇(↑ℓ W ) )︁ × (︁ sizeof( ̄ W int )퐶 xfer +퐶 fixed )︁ 퐶 repart has two parts. First is the cost of creating and processing the 푇(↑ℓ W in ∪↑ℓ W )−푇(↑ℓ W in ) new tuples during the first, decomposition step. And second is aggregating those tuples into푇(↑ℓ W )inputs to the next operation; this requires moving around푇(↑ℓ W in ∪↑ℓ W )− 푇(↑ℓ W ) intermediate tuples. 8 OPTIMIZING THE DECOMPOSITION We now define a dynamic programming algorithm that computes the lowest cost, sparsity-aware, decomposition of a DAG ofEinSum expressions. This algorithm re-writes each expression in the DAG into upper-case-lower-case notation. Combined with the results from Section 5, this allows such anEinSumprogram to be compiled into a high-performance, tensor-relational computation. 8.1 Dynamic Programming Consider a directed graph퐺= ⟨푉,퐸⟩, where each vertex in the vertex set푉represents anEinSumoperation.푉has two types of vertices: (1) those with no inputs (these correspond to input tensors) and (2) those with two inputs (these correspond to binaryEinSum expressions). For our optimization algorithm, the important features of eachEinSumoperation are the sets of labelsℓ U ,ℓ V ,ℓ W present in the expression, as well as the bound vectors of the input and output tensors. These will allow us to make use of the cost models from the previous section. Each directed edge in퐸is a triple consisting of (1) the ver- tex/EinSumexpression producing a tensor, (2) the vertex/EinSum Algorithm 1 ComputeMinCostForTree 1: topologically sort the vertices of the input EinSum graph 퐺 2:initialize lookup cost table퐶so that for any vertex푣with no inputs,퐶[푣,↑ℓ W ]=0 forall↑ℓ W ∈ ℓ W , whereℓ W is the list of labels for the tensor W produced by 푣 3: for each vertex 푣 ∈ 푉 , in sorted order do 4:% input tensors always have cost 0 5: if 푣 has no inputs then 6:continue 7: end if 8:find 푣 U s.t.(푣 U ,푣, left) ∈ 퐸 9:find 푣 V s.t.(푣 V ,푣, right) ∈ 퐸 10: 푐표푠푡 min ←∞ 11: extract ℓ U , ℓ V , ℓ W from vertex 푣 12:% Here it will consider every output decomp for W 13: for each subset↑ℓ W ∈ ℓ W do 14:%↑ℓ U is consistent with↑ℓ W if, for each 푙 in(ℓ U ∩ ℓ W ), 15:% 푙 is in↑ℓ U iff it is also in↑ℓ W 16:for each subset↑ℓ U ∈ ℓ U that is consistent with↑ℓ W do 17:푐표푠푡 U ← min ↑ℓ U in 퐶 repart (↑ℓ U in ,↑ℓ U )+퐶[푣 U ,↑ℓ U in ] 18: for each subset↑ℓ V ∈ ℓ V that is consistent with↑ℓ W do 19:푐표푠푡 V ← min ↑ℓ V in 퐶 repart (↑ℓ V in ,↑ℓ V )+퐶[푣 V ,↑ℓ V in ] 20:푐표푠푡 W ← 푐표푠푡 U +푐표푠푡 V + 퐶 join (↑ℓ U ,↑ℓ V )+퐶 agg (↑ℓ U ,↑ℓ V ) 21:if 푐표푠푡 W < 푐표푠푡 min then 22:푐표푠푡 min ← 푐표푠푡 W 23:end if 24:end for 25:end for 26:% remember the low cost to computeWvia decomp↑ℓ W 27: 퐶[푣,↑ℓ W ] ← 푐표푠푡 min 28: end for 29: end for 30: let 푣 be the last vertex in푉 31: return min ↑ℓ W 퐶[푣,↑ℓ W ] expression consuming that same tensor, and (3) a constant that is eitherleftorrightindicating if this tensor is the left or right input into the consuming vertex. The dynamic programming algorithm works its way through the graph, considering vertices in topologically-sorted order. As it does so, it populates a lookup table퐶. For a vertex푣producing a tensor with label listℓ W ,퐶[푣,↑ℓ W ]will be computed for every possible set of promoted labels↑ℓ W —this is all possible subsets ofℓ W .퐶[푣,↑ℓ W ] stores the minimum possible cost to produce tensorW—considering all of theEinSumexpressions necessary to computeW—subject to the constraint that the tensor-relational decomposition of the output tensor is described by↑ℓ W . It is possible to populate퐶in a single pass through the graph via a dynamic programming algorithm because as long as each tensor is consumed by exactly oneEinSumexpression, we do not care about the details of how the output associated with a vertex푣is computed; we only care about (a) the output decomposition resulting from푣, and (b) the cost to compute푣. As long as퐶stores the lowest cost to Algorithm 2 GetStats 1:compile inputEinSumDAG into SQL over relations storing only scalars—that is, let↑ℓ U = ℓ U and↓ℓ U =for each tensor U, and compute statistics assuming that all tuples storing zero- valued scalars are removed from the resulting relations 2: compile purely-relational SQL into relational algebra using classical methods 3: cost relational algebra using classical methods, to compute/es- timate the number of tuples for each input and intermediate relation; for relation ̄ Ucorresponding to tensorU, let푇(U) denote this tuple count 4: cost relational algebra using classical methods, to compute/es- timate the number of distinct values for each attribute; for attribute 푙 and relation ̄ U, let푉(U,푙) denote this value count compute all decompositions of anEinSumexpression corresponding to vertex푣, to optimize the decomposition of a consumer of푣, we need only look at all entries of the form퐶[푣, .]. The dynamic programming algorithm is given as Algorithm 1. For each vertex푣there is a triply-nested loop. At the outer level, we are computing the optimal cost for every possible tensor-relational decomposition of the output of푣, described by the set of promoted labels↑ℓ W . The goal is to compute the lowest cost to produce↑ℓ W for 푣. The two inner loops consider all possible input decompositions that are consistent with↑ℓ W . If↑ℓ W promotes a label푙so that it is handled relationally, then both inputs must promote that label. By considering all of these consistent input combinations, the optimal value for퐶[푣,↑ℓ W ] is computed. 8.2 Additional Considerations Pre-computing the required statistics. In practice, running Algo- rithm 1 requires that for each tensorU, we have푇(U)(the number of non-zero entries inU), and that we have푉(U,푙)for each label 푙(the number of non-zero tensors induced by label푙). Obtaining these statistics can be achieved by running the GetStats algo- rithm: Algorithm 2. At the highest level, this algorithm relies on generating SQL corresponding to a fully-relational version of the inputEinSumprogram, and then using classical relational costing methods to compute the required statistics. When results are consumed more than once. The prior dynamic programming algorithm cannot work in this case, as the algorithm relies on the set of entries퐶[푣, .]fully describing all information necessary to optimally use theEinSumexpression corresponding to vertex푣. However, data in machine learning computations (both input and intermediate) are often used multiple times. For example, during backpropagation, intermediate data are used both during the forward and backward passes. In both the graph convolutional neural network and the attention computation in our experiments, input data are used in twoEinSumexpressions. If푣is used more than once, an optimal algorithm would need to worry about both uses of 푣 , and co-optimize those. The approach we implement is to decompose anEinSumprogram into a set of “trees”. That is, we first choose a subgraph of program 퐺where noEinSumresult is used more than once (in general, this subgraph should be “maximal” in the sense that if we can add DatasetClassesFeaturesNodesEdges Cora71,4332,4855,069 Planetoid.Cora [43]71,4332,70810,556 Planetoid .CiteSeer [43]63,7033,3279,104 CitationFull .CiteSeer [7]66024,23010,674 Amazon.Photo [34]87457,6500.2M Amazon .Computers [34]1076713,7520.4M ogbn-arxiv* [21]40128169,3431.1M ogbn-products* [21]471002.4M61.8M ogbn- papers100M* [21]172128111M1.6B friendster* [9]10012865M1.8B Table 1: Statistics of the datasets after graph standardization. Datasets marked with * are used for distributed execution. another vertex and it is still the case that no vertex is used more than once, we should do so). Then, we optimize this subgraph, using Algorithm 1. At this point, the tensor-relational decomposition of each tensor that has been optimized is treated as being fixed. This process is then repeated, and another subgraph of program퐺where noEinSumresult is used more than once is chosen and optimized, until all decompositions have been chosen. This will not arrive at a globally optimal solution, and it could be the case that some tensor-relational decompositions that are used twice are not chosen so as to be globally optimal. However, as the cost of this is only a sub-optimal, additional repartition, the expected cost of this non-optimality will hopefully be inconsequential. Memory constraints. Often, in terms of performance, the optimal schema design is to use only massive tensors, and not to decompose the tensors at all. In practice, this will lead to memory problems. In our implementation, we address this by constraining the search to reflect memory limitations—maximum sizes for the total bytes required for the inputs and output of a kernel call, for example. Non-Binary EinSum expressions. All of the algorithms and cost models described thus far in the paper can easily be extended to unaryEinSumexpressions, and we do not consider this issue further. Higher-degreeEinSumexpressions are more challenging since the contraction path is critical for performance. While rare in ML computations, they do occur, particularly in other application do- mains. Our suggested tactic is to use optimization libraries like opt_einsum[12] to compute the contraction path and then break such expressions as a series of binaryEinSumexpressions. More sophisticated approaches are a topic for future work. 9 EXPERIMENTS 9.1 Experimental Overview Experimentally, we wish to answer the question: Can upper-case- lower-case Einstein notation generated by SparseEinSum show better scalability over traditional tensor or purely relational approaches? Dataset andTensor-basedSparseEinSum num machinesDGLAliOpt2nd3rd ogbn-arxiv 14.312.98.313.421.8 21.66.95.28.812.6 40.93.42.04.16.3 80.62.01.23.65.7 ogbn-products 120.396.731.435.737.2 217.951.116.522.628.1 49.631.78.411.815.9 86.318.85.98.611.2 1OOMOOM289.5306.1351.9 ogbn-2OOMOOM171.4168.4207.4 papers100M492.5OOM82.193.595.1 859.3OOM42.452.061.1 friendster 1OOMOOM473.9686.0721.1 2OOMOOM324.0413.1556.1 4OOMOOM156.9277.6381.3 8103.1OOM94.5186.4206.1 Table 2: Running time comparison (in seconds) for one it- eration of GCN training across different cluster sizes. DGL (PyTorch) and AliGraph are compared with the top three SparseEinSum decompositions. 9.2 Evaluating SparseEinSum Decompositions Distributed graph convolution neural networks (GCNs) [26]. We compare SparseEinSum-generated tensor-relational computa- tions with DGL v2.4.0 1 [39] running on a PyTorch backend [30]— this is a fully tensor-based system—as well as AliGraph [51]. These systems are chosen as they represent the most well-known systems for graph neural network computation. The distributed engine used to run the SparseEinSum-generated tensor-relational computation is PlinyCompute 2 [52], a distributed relational computation en- gine. We use TACO 3 compiler-generated kernels [11,27,28] to efficiently support tensor operations of arbitrary shape within the tensor-relational computations generated by SparseEinSum. Computational problem. The core computation in a graph convolu- tional neural network is given as the following EinSum: T 0 푖,푘 ← ∑︂ 푗 ˆ D 푖,푗 × ˆ A 푗,푘 ;T 1 푖,푙 ← ∑︂ 푘 T 0 푖,푘 × ˆ D 푘,푙 T 2 푖,푚 ← ∑︂ 푙 T 1 푖,푙 × H (푙) 푙,푚 ;T 3 푖,푛 ← ∑︂ 푚 T 2 푖,푚 × W (푙) 푚,푛 H (푙+1) 푖,푛 ← ∑︂ ∅ 휎(T 3 푖,푛 ) Here, ˆ A is the adjacency matrix with self-loops added, ˆ Dis the degree matrix,H (푙) is the node embedding matrix at layer푙, and W (푙) is the weight matrix at layer푙. This expression computes the node embeddings for the next layer by: (1) normalizing the adjacency matrix using the degree matrices ( ˆ D − 1 2 ˆ A ˆ D − 1 2 ), (2) aggre- gating neighbor features through this normalized adjacency matrix, and (3) applying a learned transformation throughW (푙) followed by a non-linear activation function 휎 . 1 2.4.x branch until commit c6c874b 2 https://github.com/riceplinygroup/plinycompute 3 https://github.com/tensor-compiler/taco Figure 1: Example of quantum circuits partitioned into mul- tiple layers from 1 to 12. DatasetQubitsDepthWidthGate Count seca_n111137111275 multiplier_n1313713496 Table 3: Dataset statistics for quantum benchmarks. CircuitNumSparseEinSum MachinesOpt2nd Best3rd Best seca_n11 136.12426.732.013 221.21427.51227.752 413.81714.34715.424 810.09412.53112.309 multiplier_n13 1102.632105.316116.229 255.24764.73171.315 436.65142.32547.426 821.92225.12730.98 Table 4: Time (secs) vs. machines for quantum circuits. We implement a two-layer GCN under transductive setting with the following hyperparameters: Adam optimizer with learning rate 휂=0.1, dropout rate훾=0.5, and hidden layer dimension퐷=256. All distributed experiments were conducted on servers equipped with Intel ® Xeon ® Silver 4214 CPU, 128GB DDR4 memory, and 1TB SSD storage. The nodes are interconnected via 10 Gbps Ether- net cable. Experiments were performed using clusters from 1 to 8 machines. Statistics describing the data sets are given in Table 1. Results and discussion. For large graphs that cannot fit into the memory of a single machine, we use ParMETIS [1] to assist with partitioning when running on DGL and AliGraph. The execution times are summarized in Table 2. To evaluate how accurate the cost model used by the SparseEinSum re-write algorithm is, we execute the best upper-case-lower-case decomposition (Opt), and the second (2nd) and third-best (3rd) as well. For the smallest graph (ogbn-arxiv), DGL (running on PyTorch) is fastest. This is not surprising, as PlinyCompute is built for scala- bility, not low latency. For the next larger graph (ogbn-products) the two approaches are about the same. For the larger two graphs (1B+ edges), the SparseEinSum-generated computation has obvi- ous advantages. It is the one that can run in all cases—DGL and AliGraph are both plagued by out-of-memory problems, despite manual tuning. In the cases when DGL can run, it is slower than the SparseEinSum-generated computation by 8.6% to nearly 40%. Note that the tensor-relational implementation scales well across different number of machines. SparseEinSum-on-PlinyCompute shows a speed-up of 5.3X, 6.8X, and 5.0X going from one to eight machines on ogbn-products, ogbn-papers100M, and friendster, re- spectively. DGL-on-Pytorch fails on the two larger graphs, but has a speedup of only 3.2X on ogbn-products. Method Datasets CoraP.P.CiteC.CiteA.A. CoraSeerSeerPhotoComp Sparse Attention Computation Hyper86690114974024167801 PstGSQL801779901215661918429OOD SQLite2205222084TLE34875TLEOOD SEinSum4.54.78.07.7192265 Dense Attention Computation Hyper2194218534577524868123 PstGSQL813081861243263618464OOD SQLite1642415180TLE39470TLETLE SEinSum60960295054OODOOD Table 5: Running time (in seconds) for attention computa- tions. P stands for Planetoid, C: CitationFull, A: Amazon. OOD is Out of Disk, as the database wrote too much data. Method Datasets CoraP.P.CiteC.CiteA.A. CoraSeerSeerPhotoComp Graph Conv. Neural Network Hyper10.710.821.0214.4197.5390.4 PstGSQL16.016.030.97.7436.1830.3 SQLiteTLETLETLETLETLETLE SEinSum7.38.422.93.319.337.6 Table 6: Graph convolutional neural network running times (in seconds). TLE indicates Time Limit Exceeded. Single-machine convolution neural networks. Single-machine experiments focus on comparing SparseEinSum-generated tensor- relational implementations with equivalent, purely-relational im- plementations. These are generated using the PortableSQLEinSum compiler [5], chosen as it is the standard method in the database literature for convertingEinSumto SQL. SparseEinSum-generated tensor-relational computations are run on top of PostgreSQL, again using TACO-generated kernels. The purely relational implementa- tion is run on SQLite [3], PostgreSQL [15], and Hyper [24]. Results and Discussion. Running times are shown in Table 6. For the four smallest data sets, there is not an appreciable difference between options: the tensor-relational implementation is generally fastest, but it cuts at most 25% off the running time of “pure rela- tional + Hyper”. For the larger two data sets, the tensor-relational implementation is ten times faster than “pure relational + Hyper”, and more than 40 times faster than “pure relational + Postgres”. In Table 8, we show the schema generated by the SparseEinSum compiler for two of the input graphs. Interestingly, the computa- tions are both implemented quite relationally. However, as one might expect, the exception is that the features and embeddings (which are dense) are implemented as vectors on a per-vertex basis, and the linear transformations are implemented as dense matrices. Distributed quantum circuit simulation [47]. We adopt the widely usedQASMbench[29] suite as the quantum circuit simulation benchmark.Qiskit1.2.4 [40] andcuda-quantum24.03.0 [25] are used to help with quantum circuit generation and execution. We UCLC EinSumPlanetoid.CiteSeer T 0 퐼,푘 ← ∑︁ 푀 X 퐼,푀 × W 푄 푀,푘 X(INT i, INT m, DOUBLE val) W 푄 (INT m, TENSOR[1024] val) T 1 퐽,푘 ← ∑︁ 푁 X 퐽,푁 × W 퐾 푁,푘 X(INT j, INT n, DOUBLE val) W 퐾 (INT n, TENSOR[1024] val) T 2 퐼,퐽,푘 ← ∑︁ ∅ T 0 퐼,푘 × A 퐼,퐽 T 0 (INT i, TENSOR[1024] val) A(INT i, INT j, DOUBLE val) T 3 퐼,퐽 ← ∑︁ 푘 T 2 퐼,퐽,푘 × T 1 퐽,푘 T 2 (INT i, INT j, TENSOR[1024] val) T 1 (INT j, TENSOR[1024] val) Attn 퐼,퐽 ← ∑︁ ∅ 1 √ 푑 푘 (T 3 퐼,퐽 )T 3 (INT i, INT j, DOUBLE val) Table 7: Upper-case-lower-case Einstein notation produced by the SparseEinSum for Planetoid.CiteSeer. selectseca_n11andmultiplier_n13as large scale benchmarks. Characteristics for those benchmarks are shown in Table 3. Computational problem. The tensor contraction for quantum gate operations can be written as: 휓 ′ 푖 푛 = ∑︂ 푖 푛−1 U (푛) 푖 푛 ,푖 푛−1 ∑︂ 푖 푛−2 U (푛−1) 푖 푛−1 ,푖 푛−2 · ∑︂ 푖 1 U (2) 푖 2 ,푖 1 ∑︂ 푖 0 U (1) 푖 1 ,푖 0 휓 푖 0 (8) whereU (푘) represents the unitary gate operation matrix at layer푘, 휓is the initial quantum state vector, and휓 ′ is the final quantum state vector after applying all gate operations. Each quantum circuit is partitioned into multiple layers where multiple gates can be executed in parallel within a layer. For each layer, we generate one unitary matrix by tensoring all gates within the layer. The whole simulation process passes qubits vector states into a unitary matrix for each layer one by one as shown in Figure 1. Results and Discussion. The results for running large-scale bench- marks on a distributed environment are shown in Table 4. We evaluate the best, second-best, and third-best decompositions pro- duced, according to the cost model used by the SparseEinSum compiler. In almost all cases, the cost model has ordered the various decompositions correctly. Scaling efficiency is reasonable: eight machines is 3.6X faster than one on seca_n11, and 4.6X faster on multiplier_n13. Scaling efficiency is worse than in the case of the GCN, as data movement overhead is high and the computation cost is relatively small. Single-machine attention computation [45,48]. Sparse atten- tion resembles the classic dense attention, but it allows an arbitrary graph to indicate which items should be attending to which items. Computational problem. The core computation in a graph trans- former is sparse attention computation: T 0 푖,푘 ← ∑︂ 푀 X 푖,푚 × W 푄 푚,푘 ;T 1 푗,푘 ← ∑︂ 푛 X 푗,푛 × W 퐾 푛,푘 T 2 푖,푗,푘 ← ∑︂ ∅ T 0 푖,푘 × A 푖,푗 ;T 3 푖,푗 ← ∑︂ 푘 T 2 푖,푗,푘 × T 1 푗,푘 Attn 푖,푗 ← ∑︂ ∅ 1 √ 푑 푘 (T 3 푖,푗 ) whereXstores node features,W 푄 andW 퐾 are learnable query and key projection matrices respectively,Ais the adjacency matrix Upper-Case-Lower-Case EinSumCitationFull.CiteSeerAmazon.Computers T 0 퐼,퐾 ← ∑︁ 퐽 ˆ D 퐼,퐽 × ˆ A 퐽,퐾 ˆ D(INT i, INT j, DOUBLE val) ˆ A(INT j, INT k, DOUBLE val) ˆ D(INT i, INT j, DOUBLE val) ˆ A(INT j, INT k, DOUBLE val) T 1 퐼,퐿 ← ∑︁ 퐾 T 0 퐼,퐾 × ˆ D 퐾,퐿 T 0 (INT i, INT k, DOUBLE val) ˆ D(INT k, INT l, DOUBLE val) T 0 (INT i, INT k, DOUBLE val) ˆ D(INT k, INT l, DOUBLE val) T 2 퐼,푚 ← ∑︁ 퐿 T 1 퐼,퐿 × H (푙) 퐿,푚 T 1 (INT i, INT l, DOUBLE val) H (푙) (INT l, TENSOR[602] val) T 1 (INT i, INT l, DOUBLE val) H (푙) (INT l, TENSOR[767] val) T 3 퐼,푛 ← ∑︁ 푚 T 2 퐼,푚 × W (푙) 푚,푛 T 2 (INT i, TENSOR[602] val) W (푙) (TENSOR[602][256] val) T 2 (INT i, TENSOR[767] val) W (푙) (TENSOR[767][256] val) H (푙+1) 퐼,푛 ← ∑︁ ∅ 휎(T 3 퐼,푛 )T 3 (INT i, TENSOR[256] val)T 3 (INT i, TENSOR[256] val) Table 8: Upper-case-lower-case Einstein notation produced by the SparseEinSum compiler for the GCN computation, along with the corresponding relational schemas, for two of the test datasets. Method Datasets CoraP.P.CiteC.CiteA.A. CoraSeerSeerPhotoComp Sparse Attention Computation SEinSum4.54.78.07.7192265 Greedy21235217769884 Dense Attention Computation SEinSum60960295054OODOOD Greedy219821244107315OODOOD Table 9: Comparing pure greedy runtime with the full dy- namic programming solution for the attention computation. indicating the graph structure,푑 푘 is the dimension of the key vec- tors, andAttn 푖,푗 gives the attention scores before normalization. Following prior work [45,48], we evaluate both dense attention (where each node attends to all other nodes) and sparse attention (where nodes only attend to their local neighborhood in the graph). Results and Discussion. Running time results are given in Table 5. The surprising finding here is that even though relational computa- tions are inherently sparse, the pure relational implementations are not much faster when using sparse attention, compared to dense attention. “Pure relational + Hyper” best utilizes the sparsity, and is up to 2X as fast in the sparse case compared to the dense case. In comparison, SparseEinSum can be more than 100X as fast when computing sparse attention rather than dense attention, and is 30X to 100X as fast as “Pure relational + Hyper” in the sparse case. Table 7 shows the generated schema for one of the data sets. The node embeddings are stored as vectors, but the weight matrices are stored into vectors, due to memory limitations. The input data itself is sparse and has been decomposed into purely relational tuples. 9.3 Ablations and Other Experiments Effectiveness of the dynamic programming. It is reasonable to ask: Is it necessary to run an expensive and complicated dynamic programming search during the SparseEinSum decomposition? We evaluate a simpler, greedy search: along a path through the graph ofEinSumexpressions, greedily pick the decomposition that minimizes the cost; do not remember decompositions with different output tensor formats that are suboptimal. We compare greedy to SparseEinSum in Table 9 for attention on one machine. Datasets Comp-CoraP.P.CiteC.CiteA.A. onentCoraSeerSeerPhotoComp Sparse Attention Computation Exec.4.54.78.07.7192265 Schema2539642886344239 SQL4611322.05.43.4 Kernels0.040.040.040.080.040.04 Dense Attention Computation Exec60960295054OODOOD Schema1461275332218621690 SQL51245844130401 Kernels0.060.060.060.060.060.06 Table 10: Running time for the various system components, attention computation, in seconds. “Exec.” is the actual exe- cution time, “Schema” is the time to pre-process the data and generate the schema, “SQL” is the time to generate the SQL including the data loading code, and “Kernels” is the time to generate the kernels using TACO. Perturbation None퐶 xfer 퐶 xfer 퐶 krnel 퐶 krnel 퐶 fixed 퐶 fixed (as-is)× 1 10 ×10× 1 10 ×10× 1 10 ×10 19213025329410271027305 Table 11: Effect of perturbing the various cost parameters (퐶 xfer ,퐶 krnel ,퐶 fixed ) on the running time for the sparse atten- tion computation on the Amazon.Photo data set. Robustness to cost model errors. Is an accurate cost model nec- essary? We perturb the cost estimates by multiplying each cost estimate by a Gamma(훼,휃)random variate, where훼 × 휃=1 so that on expectation, the cost value is unchanged, but errors are possible. As we decrease훼and increase휃the variance of the esti- mate (and hence the potential for error) increases. Results on the single-machine graph neural network computation are given in Table 12. Running time is averaged over ten different perturbed runs. Here we see that with훼= 휃=1, there is no difference in the running time—as this is a significant level of noise, it suggests some robustness. The higher perturbation levels lead to slower times. Interestingly, 훼= 1 3 ,휃= 3 is worse than 훼= 1 10 ,휃= 10. Robustness to cost parameters. We hold two of the three cost parameters (퐶 xfer ,퐶 krnel ,퐶 fixed ) constant, and perturb the third, Error Datasets CoraP.P.CiteC.CiteA.A. CoraSeerSeerPhotoComp Graph Conv. Neural Network None7.38.422.93.319.337.6 Γ(1, 1)7.28.321.53.419.537.1 Γ( 1 3 , 3) 16.116.031.98.0435.9839.9 Γ( 1 10 , 10)12.011.225.17.468.047.4 Table 12: Graph neural network running times (in seconds) with erroneous/noisy cardinality estimates. Here,Γ(훼,휃) means each estimate is multiplied by a random number gen- erated using a Gamma(훼,휃) distribution. either cutting it by a factor of ten, or multiplying it by a factor of ten. Single-machine results are given in Table 12. A poor choice in cost parameter can be quite damaging. Interestingly, cutting퐶 xfer by a factor of ten actually reduced the running time of the resulting decomposition, suggesting that the퐶 xfer we chose was not optimal. Running time breakdown. Table 10 compares the execution time for the actual attention computation with the time taken by the decomposition itself. In addition to the execution time, three other times are given: (1) Schema: time to process the input data, compute the necessary statistics, and run the dynamic programming. (2) SQL: time to generate the SQL, which is dominated by the time necessary to generate theINSERT INTOstatements to load all of the data into PostgreSQL. (3) Kernels: time to generate the TACO kernels. We see that these times can be quite significant, though our code is written in Python, and for the single-machine attention experiments, all data are stored in text files. Further, to ensure portability across engines all data loading is handled using SQL, and is verbose. During deployment or training the attention computation would by run hundreds of times or more, so time would be amortized across runs. 10 RELATED WORK EinSumwas proposed by Einstein [16], and is now supported by machine learning/AI frameworks such as NumPy [20], JAX [18], Py- Torch [30] and Dask [13]. However, upper-case-lower-caseEinSum, whereEinSumis modified to facilitate sharding specification, is unique. A general problem in this space is how to specify sharding— seehttps://jax-ml.github.io/scaling-book/sharding/for an example of such an effort—but upper-case-lower-caseEinSumis unique in its ability to combine a specification of the computation with the way in which the computation is decomposed. We propose a dynamic programming algorithm for the problem of re-writing a DAG ofEinSumexpressions (where paths between any connected vertices are unique) to a DAG of upper-case-lower- caseEinSum. Dynamic programming algorithms of this variety have been around for a long time; our algorithm is strongly reminiscent of Felsenstein’s algorithm [17] from computational biology, and other efforts in sharding of compute graphs for machine learning have used dynamic programming [8,49]. Likewise there exist cost models for sparse tensor computations [2, 35]. There are a number of kernel compilers for tensor computations, such as TVM [10], Triton [38], and TACO [11,27,28]. Both TACO and TVM (via SparseTIR [44]) provide explicit support for sparsity. Of the various tensor compilers with support for sparsity, Galley [14] is closest to our own proposal, using many ideas from relational system design (such as applying relational methods for sparsity estimation; see Section 7 of this paper). This existing work targets the generation of CPU or GPU kernels or programs, not scalable relational computations that are meant to run on top of a relational system. The core idea in the SparseEinSum re-write approach is to determine which parts of the computation should be run relationally and which parts should utilize high-performance CPU or GPU kernels generated by these other systems. In that sense, SparseEinSum sits on top of these tools and is a consumer of the kernels that they produce. In fact, our SparseEinSum prototype uses TACO to generate dense kernels. There has been other recent work that explores the close rela- tionship between tensors and relations in order to optimize tensor computations. Schleich et al. propose STOREL [31], which accepts a user-supplied storage format for each input tensor, and then opti- mizes the tensor computation. In contrast, SparseEinSum accepts a declaratively-specified computation and then optimizes the rep- resentation format, choosing how to decompose the tensors into relations so as to optimize performance. The SparseEinSum ap- proach relies on the underlying relational system to optimize the computation. In the Duck’s Brain [32], Schüle et al. are the most recent to argue for the use of relational systems augmented with tensors for mod- ern ML computations—using precisely the sort of tensor-relational representations described in this paper. The Duck’s Brain is an ex- ample of a relational platform on which the output of SparseEinSum- generated tensor-relational computation could be run. The idea of combined tensor and relational processing has been explored in other works [22,23,37,46], including SystemML [19] and SystemDS [6], which extended SQL-style operations with matrix primitives. However, these prior works do not consider how to automatically generate optimized tensor-relational representations. Like in the case of the Duck’s Brain, SparseEinSum could be used to generate the decomposition used by these systems. Another approach explored in the literature is the use of pure relational implementations of ML or AI computations (see, for exam- ple Pytond [33] or Portable SQL [5] for two recent works). Another recent work proposes an algorithm to switch from dense to sparse dynamically during runtime [36]. 11 CONCLUSIONS We introduced an algorithm that decomposes tensor computation into equivalent tensor-relational computation, as well as SparseEin- Sum, a system that takesEinSumcomputation as input, and compiles computation into upper-case-lower-case Einstein notation. This computation can be implemented on virtually any database sys- tem with multidimensional array support. The compilation process incorporates sparsity estimation techniques and automatic cost- based schema optimization, enabling generated tensor-relational computations to achieve significant performance improvements. ACKNOWLEDGMENTS This research was supported by NSF grants 1918651, 2008240, 2131294, and 2212557, NIH CTSA #UM1TR004906, and US DOT Tier-1 UTC CYBER-CARE grant #69A3552348332. REFERENCES [1] [n.d.]. ParMETIS. https://github.com/KarypisLab/ParMETIS. [2] Willow Ahrens, Fredrik Kjolstad, and Saman Amarasinghe. 2022. Autoscheduling for sparse tensor algebra with an asymptotic cost model. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 269–285. [3] Grant Allen, Mike Owens, Grant Allen, and Mike Owens. 2010. Introducing SQLite. The Definitive Guide to SQLite (2010), 1–16. [4]Alan H Barr. 1991. The Einstein summation notation. An Introduction to Physically Based Modeling (Course Notes 19), pages E 1 (1991), 57. [5]Mark Blacher, Julien Klaus, Christoph Staudt, Sören Laue, Viktor Leis, and Joachim Giesen. 2023. Efficient and Portable Einstein Summation in SQL. Proc. ACM Manag. Data 1, 2, Article 121 (jun 2023), 19 pages. https://doi.org/10.1145/ 3589266 [6]Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginthör, Kevin Innerebner, Florijan Klezin, Stefanie Lindstaedt, Arnab Phani, Benjamin Rath, et al.2019. SystemDS: A declarative machine learning system for the end-to-end data science lifecycle. arXiv preprint arXiv:1909.02976 (2019). [7]Aleksandar Bojchevski and Stephan Günnemann. 2017. Deep gaussian embed- ding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815 (2017). [8] Daniel Bourgeois, Zhimin Ding, Dimitrije Jankov, Jiehui Li, Mahmoud Sleem, Yuxin Tang, Jiawen Yao, Xinyu Yao, and Chris Jermaine. 2024. EinDecomp: Decomposition of Declaratively-Specified Machine Learning and Numerical Computations for Parallel Execution. arXiv preprint arXiv:2410.02682 (2024). [9]Danah Michele Boyd. 2004. Friendster and publicly articulated social networking. In CHI’04 extended abstracts on Human factors in computing systems. 1279–1282. [10]Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al.2018.TVM: An automatedEnd-to-Endoptimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594. [11]Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstrac- tion for sparse tensor algebra compilers. Proc. ACM Program. Lang. 2, OOPSLA, Article 123 (Oct. 2018), 30 pages. https://doi.org/10.1145/3276493 [12] G Daniel, Johnnie Gray, et al.2018. Opt\_einsum-a python package for opti- mizing contraction order for einsum-like expressions. Journal of Open Source Software 3, 26 (2018), 753. [13] Jesse Daniel. 2019. Data science with Python and Dask. Simon and Schuster. [14] Kyle Deeds, Willow Ahrens, Magdalena Balazinska, and Dan Suciu. 2025. Galley: Modern Query Optimization for Sparse Tensor Programs. Proceedings of the ACM on Management of Data 3, 3 (2025), 1–24. [15] Korry Douglas and Susan Douglas. 2003. PostgreSQL: a comprehensive guide to building, programming, and administering PostgresSQL databases. SAMS publish- ing. [16] Albert Einstein, Leopold Infeld, and Banesh Hoffmann. 1938. The gravitational equations and the problem of motion. Annals of mathematics (1938), 65–100. [17] Joseph Felsenstein. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of molecular evolution 17, 6 (1981), 368–376. [18] Roy Frostig, Matthew James Johnson, and Chris Leary. 2018. Compiling machine learning programs via high-level tracing. Systems for Machine Learning 4, 9 (2018). [19]Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Rein- wald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce. In 2011 IEEE 27th International conference on data engineering. IEEE, 231–242. [20]Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al.2020. Array programming with NumPy. Nature 585, 7825 (2020), 357–362. [21]Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems 33 (2020), 22118–22133. [22]Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jer- maine, and Zekai J Gao. 2019. Declarative recursive computation on an rdbms, or, why you should use a database for distributed machine learning. arXiv preprint arXiv:1904.11121 (2019). [23]Dimitrije Jankov, Binhang Yuan, Shangyu Luo, and Chris Jermaine. 2021. Dis- tributed numerical and machine learning computations via two-phase execution of aggregated join trees. Proceedings of the VLDB Endowment 14, 7 (2021). [24]Alfons Kemper and Thomas Neumann. 2011. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering. IEEE, 195–206. [25] Jin-Sung Kim, Alex McCaskey, Bettina Heim, Manish Modani, Sam Stanwyck, and Timothy Costa. 2023. Cuda quantum: The platform for integrated quantum- classical computing. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–4. [26]Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016). [27] Fredrik Kjolstad, Willow Ahrens, Shoaib Kamil, and Saman Amarasinghe. 2019. Tensor algebra compilation with workspaces. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 180–192. [28] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amaras- inghe. 2017. The tensor algebra compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages. https://doi.org/10.1145/3133901 [29]Ang Li, Samuel Stein, Sriram Krishnamoorthy, and James Ang. 2023. Qasmbench: A low-level quantum benchmark suite for nisq evaluation and simulation. ACM Transactions on Quantum Computing 4, 2 (2023), 1–26. [30] A Paszke. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019). [31]Maximilian Schleich, Amir Shaikhha, and Dan Suciu. 2023. Optimizing tensor programs on flexible storage. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–27. [32] Maximilian Schüle, Thomas Neumann, and Alfons Kemper. 2024. The Duck’s Brain: Training and Inference of Neural Networks within Database Engines. Datenbank-Spektrum 24, 3 (2024), 209–221. [33]Hesam Shahrokhi, Amirali Kaboli, Mahdi Ghorbani, and Amir Shaikhha. 2024. Pytond: Efficient python data science on the shoulders of databases. In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 423–435. [34]Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018). [35]Johanna Sommer, Matthias Boehm, Alexandre V Evfimievski, Berthold Reinwald, and Peter J Haas. 2019. Mnc: Structure-exploiting sparsity estimation for matrix expressions. In Proceedings of the 2019 International Conference on Management of Data. 1607–1623. [36] Christoph Staudt, Mark Blacher, Tim Hoffmann, Kaspar Kasche, Olaf Beyersdorff, and Joachim Giesen. 2025. Exploiting Dynamic Sparsity in Einsum. In The Thirty- ninth Annual Conference on Neural Information Processing Systems. [37]Yuxin Tang, Zhimin Ding, Dimitrije Jankov, Binhang Yuan, Daniel Bourgeois, and Chris Jermaine. 2023. Auto-differentiation of relational computations for very large scale machine learning. In International Conference on Machine Learning. PMLR, 33581–33598. [38]Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Pro- gramming Languages. 10–19. [39]Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. 2020. Deep Graph Library: A Graph-Centric, Highly- Performant Package for Graph Neural Networks. arXiv:1909.01315 [cs.LG] https://arxiv.org/abs/1909.01315 [40]Robert Wille, Rod Van Meter, and Yehuda Naveh. 2019. IBM’s Qiskit tool chain: Working with and developing for real quantum computers. In 2019 Design, Au- tomation & Test in Europe Conference & Exhibition (DATE). IEEE, 1234–1240. [41]Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24. [42]Lizhi Xiang, Omid Asudeh, Gerald Sabin, Aravind Sukumaran-Rajam, and P Sadayappan. 2025. cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores. arXiv preprint arXiv:2504.06443 (2025). [43] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi- supervised learning with graph embeddings. In International conference on ma- chine learning. PMLR, 40–48. [44]Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. 2023. Sparsetir: Composable abstractions for sparse compilation in deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, Volume 3. 660–678. [45]Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do transformers really perform badly for graph representation? Advances in neural information processing systems 34 (2021), 28877–28888. [46]Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, and Chris Jermaine. 2021. Tensor relational algebra for distributed machine learning system design. Proc. VLDB Endow. 14, 8 (April 2021), 1338–1350. https://doi.org/ 10.14778/3457390.3457399 [47]Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, and Jidong Zhai. 2021. HyQuas: hybrid partitioner based quantum circuit simulation system on GPU. In Proceedings of the 35th ACM International Conference on Supercomputing. 443– 454. [48]Meng Zhang, Jie Sun, Qinghao Hu, Peng Sun, Zeke Wang, Yonggang Wen, and Tianwei Zhang. 2024. TorchGT: A Holistic System for Large-Scale Graph Transformer Training. In SC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–17. [49]Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan- ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al.2022. Alpa: Automating inter-andIntra-Operatorparallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 22). 559–578. [50]Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI open 1 (2020), 57–81. [51] Rong Zhu, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou. 2019. AliGraph: A Comprehensive Graph Neural Network Platform. arXiv:1902.08730 [cs.DC] https://arxiv.org/abs/1902.08730 [52] Jia Zou, R Matthew Barnett, Tania Lorido-Botran, Shangyu Luo, Carlos Mon- roy, Sourav Sikdar, Kia Teymourian, Binhang Yuan, and Chris Jermaine. 2018. PlinyCompute: A platform for high-performance, distributed, data-intensive tool development. In Proceedings of the 2018 International Conference on Management of Data. 1189–1204.