← Back to papers

Paper deep dive

NeuroGame Transformer: Gibbs-Inspired Attention Driven by Game Theory and Statistical Physics

Djamel Bouchaffra, Fayçal Ykhlef, Hanene Azzag, Mustapha Lebbah, Bilal Faye

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 76

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/22/2026, 6:08:43 AM

Summary

The NeuroGame Transformer (NGT) is a novel architecture that replaces standard pairwise attention with a framework grounded in cooperative game theory and statistical physics. It models tokens as players in a game and spins in an Ising Hamiltonian system, using Shapley values and Banzhaf indices to quantify token importance. Attention weights are derived from the Gibbs distribution via mean-field approximations, enabling higher-order semantic modeling and improved performance on natural language inference benchmarks like SNLI.

Entities (6)

NeuroGame Transformer · model-architecture · 100%SNLI · benchmark-dataset · 99%Banzhaf indices · game-theoretic-concept · 98%Shapley values · game-theoretic-concept · 98%Gibbs distribution · statistical-physics-concept · 95%Ising Hamiltonian · statistical-physics-model · 95%

Relation Signals (4)

NeuroGame Transformer evaluatedon SNLI

confidence 99% · Experimental results demonstrate that the NeuroGame Transformer achieves strong performance across SNLI

NeuroGame Transformer implements Ising Hamiltonian

confidence 95% · The system's energy follows an Ising Hamiltonian

NeuroGame Transformer utilizes Shapley values

confidence 95% · Token importance is quantified using two complementary game-theoretic concepts -- Shapley values for global, permutation-based attribution

NeuroGame Transformer utilizes Banzhaf indices

confidence 95% · and Banzhaf indices for local, coalition-level influence.

Cypher Suggestions (2)

Identify benchmarks where the NeuroGame Transformer was evaluated · confidence 95% · unvalidated

MATCH (m:Model {name: 'NeuroGame Transformer'})-[:EVALUATED_ON]->(b:Dataset) RETURN b.name

Find all game-theoretic concepts used by the NeuroGame Transformer · confidence 90% · unvalidated

MATCH (m:Model {name: 'NeuroGame Transformer'})-[:UTILIZES]->(c:Concept) RETURN c.name

Abstract

Abstract:Standard attention mechanisms in transformers are limited by their pairwise formulation, which hinders the modeling of higher-order dependencies among tokens. We introduce the NeuroGame Transformer (NGT) to overcome this by reconceptualizing attention through a dual perspective: tokens are treated simultaneously as players in a cooperative game and as interacting spins in a statistical physics system. Token importance is quantified using two complementary game-theoretic concepts -- Shapley values for global, permutation-based attribution and Banzhaf indices for local, coalition-level influence. These are combined via a learnable gating parameter to form an external magnetic field, while pairwise interaction potentials capture synergistic relationships. The system's energy follows an Ising Hamiltonian, with attention weights emerging as marginal probabilities under the Gibbs distribution, efficiently computed via mean-field equations. To ensure scalability despite the exponential coalition space, we develop importance-weighted Monte Carlo estimators with Gibbs-distributed weights. This approach avoids explicit exponential factors, ensuring numerical stability for long sequences. We provide theoretical convergence guarantees and characterize the fairness-sensitivity trade-off governed by the interpolation parameter. Experimental results demonstrate that the NeuroGame Transformer achieves strong performance across SNLI, and MNLI-matched, outperforming some major efficient transformer baselines. On SNLI, it attains a test accuracy of 86.4\% (with a peak validation accuracy of 86.6\%), surpassing ALBERT-Base and remaining highly competitive with RoBERTa-Base. Code is available at this https URL.

Tags

ai-safety (imported, 100%)csai (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

75,685 characters extracted from source content.

Expand or collapse full text

NeuroGame Transformer: Gibbs-Inspired Attention Driven by Game Theory and Statistical Physics Djamel Bouchaffra1, Faycal Ykhlef2, Hanane Azzag3, Mustapha Lebbah1, Bilal Faye3 djamel.bouchaffra, mustapha.lebbah@uvsq.fr, fykhlef@cdta.dz, bilal.faye, azzag@lipn.univ-paris13.fr Abstract Standard attention mechanisms in transformers are limited by their pairwise formulation, which hinders the modeling of higher-order dependencies among tokens. We introduce the NeuroGame Transformer (NGT) to overcome this by reconceptualizing attention through a dual perspective: tokens are treated simultaneously as players in a cooperative game and as interacting spins in a statistical physics system. Token importance is quantified using two complementary game-theoretic concepts—Shapley values for global, permutation-based attribution and Banzhaf indices for local, coalition-level influence. These are combined via a learnable gating parameter to form an external magnetic field, while pairwise interaction potentials capture synergistic relationships. The system’s energy follows an Ising Hamiltonian, with attention weights emerging as marginal probabilities under the Gibbs distribution, efficiently computed via mean-field equations. To ensure scalability despite the exponential coalition space, we develop importance-weighted Monte Carlo estimators with Gibbs-distributed weights. This approach avoids explicit exponential factors, ensuring numerical stability for long sequences. We provide theoretical convergence guarantees and characterize the fairness–sensitivity trade-off governed by the interpolation parameter. Experimental results demonstrate that the NeuroGame Transformer achieves strong performance across SNLI, and MNLI-matched, outperforming some major efficient transformer baselines. On SNLI, it attains a test accuracy of 86.4% (with a peak validation accuracy of 86.6%), surpassing ALBERT‑Base and remaining highly competitive with RoBERTa‑Base. Code is available at https://github.com/dbouchaffra/NeuroGame-Transformer. 1 Introduction Transformers have revolutionized natural language processing through attention mechanisms that compute context-dependent token representations. However, conventional attention relies primarily on pairwise similarity, limiting its ability to capture higher-order semantic dependencies involving multiple tokens. Recent advances explore higher-order attention [1], kernel-based reformulations [8], linear approximations [18], state-space models [16], and dynamic computation [9]—developments critical for long-context domains such as genomics [28], high-resolution vision, and video understanding [14]. We propose a fundamentally different perspective bridging cooperative game theory and statistical physics to reconceptualize transformer attention [5, 4]. Instead of pairwise similarity, we treat tokens as both strategic players in a cooperative game and interacting spins in a thermodynamic system. This dual interpretation enables modeling higher-order semantic structure through token coalitions. Concretely, we introduce a semantic coalition game over token subsets: each coalition C is assigned a scalar value v​(C)=−E​(C)v(C)=-E(C), where E​(C)E(C) is a physical energy. Coherent coalitions receive lower energy and thus higher activation. Shapley values capture globally fair contributions, while Banzhaf power indices identify locally decisive tokens; we combine them into an external magnetic field that balances global and local sensitivity. Pairwise interaction potentials capture synergistic or antagonistic token relationships, and attention weights are derived from the Gibbs (Boltzmann) distribution over spin configurations, computed via mean-field equations to avoid the intractable sum over 2n2^n states. To compute Shapley values, Banzhaf indices, and interaction potentials efficiently, we employ importance-weighted Monte Carlo estimation with Gibbs-distributed importance weights. This avoids explicit 2n2^n factors, yields unbiased estimates, and makes the partition function cancel naturally. Building on these principles, we introduce the NeuroGame Transformer, which makes the following contributions: • We define a coalitional semantic game with an energy-based characteristic function, enabling higher-order semantic modeling grounded in statistical physics. • We develop scalable Monte Carlo estimators for Shapley values, Banzhaf indices, and interaction potentials using Gibbs-distributed importance weights, achieving linear complexity ​(K⋅n)O(K· n) instead of exponential ​(2n)O(2^n), where K is the number of sampled coalitions. • We derive a NeuroGame attention mechanism combining Shapley and Banzhaf values via a learnable context-dependent parameter, with pairwise potentials capturing token interactions. • We compute attention weights via mean-field equations, providing a principled mechanism uniting coalition-based reasoning and energy-based modeling. • We demonstrate empirically that the NeuroGame Transformer outperforms standard and efficient transformer baselines on natural language inference benchmarks, yielding interpretable coalition-aware attention patterns. The NeuroGame Transformer departs fundamentally from conventional attention by evaluating each token’s marginal contribution across all coalitions using game-theoretic principles and deriving attention weights from the Gibbs distribution of a statistical physics system. This coalition-aware perspective, made tractable through Gibbs-based Monte Carlo estimation and mean-field approximation, enables richer semantic modeling and explicit interpretability. 2 Related Work 2.1 Standard and Efficient Attention The transformer architecture [30] relies on pairwise dot-product attention. Efficient variants (e.g., sparse [11], linearized [18, 8]) improve scalability but preserve the pairwise similarity paradigm. 2.2 Game-Theoretic Methods in Machine Learning Shapley values [27] and Banzhaf indices [2] are widely used for post-hoc feature attribution [23, 29] and data valuation [15]. Their integration as a core attention mechanism remains unexplored. 2.3 Statistical Physics and Energy-Based Models The softmax operation can be viewed as a Boltzmann distribution over energies [31, 12], with temperature controlling sharpness [17]. Energy-based perspectives have informed variational formulations [20] and softmax alternatives [24]. 2.4 Our Positioning The NeuroGame Transformer uniquely synthesizes three traditionally separate directions. Unlike efficient transformers that approximate pairwise similarity, we model higher-order semantic interactions through token coalitions. Unlike game-theoretic methods that apply Shapley and Banzhaf values post-hoc, we integrate them directly into attention computation. Unlike energy-based models that define arbitrary energy functions, we ground ours in cooperative game theory via the Ising Hamiltonian. Our approach is constructive rather than descriptive: we build attention from first principles using game-theoretic valuation and statistical physics, yielding an expressive, inherently interpretable, and theoretically grounded mechanism. 3 NeuroGame Transformer: Fusion of Game Theory and Statistical Physics 3.1 Notation and Preliminaries Consider a sequence of n tokens =t1,t2,…,tnT=\t_1,t_2,…,t_n\, where each token tit_i is associated with a contextual embedding i∈ℝdx_i ^d. We denote the power set of all possible token coalitions by 22^T, which contains 2n2^n elements. Definition 1 (Token Coalition). When tokens are viewed as players in a cooperative game, a coalition is defined as any subset C⊆C . The set of all possible coalitions is given by the power set 22^T. Each coalition represents a group of tokens whose joint presence contributes collectively to the model’s representation or prediction. This contribution is quantified by a characteristic function: v:2→ℝv:2^T , where v​(C)v(C) measures the semantic, predictive, or energetic contribution of the token subset C. For a token ti∈t_i and a coalition C⊆∖tiC \t_i\, the marginal contribution of tit_i to C is: Δi​v​(C)=v​(C∪ti)−v​(C). _iv(C)=v(C∪\t_i\)-v(C). (1) 3.2 The Coalitional Semantic Game We define a semantic characteristic function v:2→ℝv:2^T that assigns to each coalition C⊆C a scalar value representing its semantic importance within the context. This function satisfies: 1. Normalization: v​(∅)=0v( )=0 2. Monotonicity: For any C⊆T⊆C T , v​(C)≤v​(T)v(C)≤ v(T) 3. Boundedness: There exists M>0M>0 such that |v​(C)|≤M|v(C)|≤ M for all C⊆C 3.3 Energy-Based Semantic Characteristic Function] Let =t1,…,tnT=\t_1,…,t_n\ denote the set of tokens in a sequence. At a given transformer layer, each token tit_i is represented by a position-aware embedding i=i+i∈ℝdx_i=e_i+p_i ^d, where ie_i denotes the token embedding and ip_i the positional encoding. Definition 2 (Energy-Based Semantic Characteristic Function). For any coalition C⊆C , we define its semantic activation potential as: v​(C)=f​(‖∑ti∈Cv​i‖2),v(C)=f ( \| _t_i∈ CW_vx_i \|_2 ), (2) where v∈ℝd×dW_v ^d× d is a learned projection matrix and f is a mild nonlinearity (e.g., ReLUReLU or tanh ). We interpret v​(C)v(C) as a coalition-level activation or interaction strength. In accordance with statistical physics, the corresponding coalition energy is defined as follows: E​(C)=−v​(C)E(C)=-v(C), which assigns lower energy to more coherent or strongly interacting coalitions. This energy formulation naturally induces a Gibbs (Boltzmann) distribution DγD_γ over token coalitions: γ​(C)∝exp⁡(−E​(C)/γ)=exp⁡(v​(C)/γ),D_γ(C) \! (-E(C)/γ )= \! (v(C)/γ ), (3) where γ>0γ>0 is a temperature parameter controlling the concentration of probability mass over coalitions. 3.4 Global and Local Tokens Contribution we define in this section two semantic measures: The global contribution conveyed by Shapley value, and the local semantic gain variation conveyed by Banzhaf power index. 3.4.1 Shapley Value: Global Semantic Contribution The Shapley value measures the average marginal semantic contribution of a token over all possible permutations of token arrivals, ensuring fairness, symmetry, and permutation invariance. Let Π​() (T) denote the set of all n!n! permutations of the token set T. For token tit_i, let i​(π)P_i(π) be the set of tokens that appear before tit_i in permutation π∈Π​()π∈ (T). The Shapley value of tit_i is then defined as: ϕi​(v)=1n!​∑π∈Π​()[v​(i​(π)∪ti)−v​(i​(π))]. _i(v)= 1n! _π∈ (T) [v (P_i(π)∪\t_i\ )-v (P_i(π) ) ]. (4) Remark 1. Since equation 4 involves an expectation over the factorial-sized permutation space Π​() (T), exact evaluation of the Shapley value is computationally infeasible, motivating the use of importance-weighted Monte Carlo estimation. In the case ∑j=1nϕj​(v)≠0 _j=1^n _j(v)≠ 0, the normalized Shapley value is expressed as: ϕinorm​(v)=ϕi​(v)∑j=1nϕj​(v) _i^norm(v)= _i(v) _j=1^n _j(v). 3.4.2 Banzhaf Power Index: Local Semantic Gain Variation The Banzhaf power index evaluates the sensitivity of semantic gain to the presence or absence of a token across coalitions, treating all coalitions as equally likely. Unlike the Shapley value, it does not consider permutations of token order and emphasizes local variations in contribution. The raw Banzhaf index of token tit_i is defined as: βi​(v)=12n−1​∑C⊆∖ti[v​(C∪ti)−v​(C)]. _i(v)= 12^n-1 _C \t_i\ [v(C∪\t_i\)-v(C) ]. (5) Remark 2. Likewise, as equation 5 involves a summation over all 2n−12^n-1 coalitions, exact computation of the Banzhaf power index is computationally intractable, motivating the use of importance-weighted Monte Carlo estimation. In the case ∑j=1nβj​(v)≠0 _j=1^n _j(v)≠ 0, the normalized form is expressed as: βinorm​(v)=βi​(v)∑j=1nβj​(v) _i^norm(v)= _i(v) _j=1^n _j(v). 3.5 Token-Level Importance Score We define, in the context of the NeuroGame transformer, a token-level importance score derived from coalitional game-theoretic measures, and subsequently map this importance to an external field of a Gibbs energy function. Definition 3 (Token Importance as External Field). The importance of token tit_i, denoted by JiJ_i, is defined as a context-aware convex combination of the normalized Shapley value and the normalized Banzhaf power index: Ji=λi​ϕinorm​(v)+(1−λi)​βinorm​(v),J_i= _i\, _i^norm(v)+(1- _i)\, _i^norm(v), (6) where: λi=σ​(wT​xi+b). _i=σ(w^Tx_i+b). (7) The parameter λi _i depends on the token representation xi∈ℝdx_i ^d, the weight vector w, the bias b∈ℝb and the sigmoidal function σ that maps the score to (0,1)(0,1). The model learns when marginal contribution (Shapley) matters (λi≈1( _i≈ 1: Shapley-dominant) and when independent swing power (Banzhaf) matters (λi≈0( _i≈ 0: Banzhaf-dominant) or a balanced regime (λi≈0.5)( _i≈ 0.5). The quantity JiJ_i measures the contribution and relevance of token tit_i within the cooperative game defined by v​(C)v(C), with larger values indicating higher importance. 3.6 Token Pairwise Interaction Likewise, we define a pairwise interaction between tokens representing the synergistic or antagonistic effect between two tokens when they are part of the same coalition. For any two distinct tokens ti,tj∈t_i,t_j , their interaction potential measures the synergistic or antagonistic effect when they appear together in a coalition, beyond what their individual contributions would predict. For a given context coalition C⊆∖ti,tjC \t_i,t_j\ that contains neither token, the conditional interaction is: Δi​j​(C)=v​(C∪ti,tj)−v​(C∪ti)−v​(C∪tj)+v​(C). _ij(C)=v(C∪\t_i,t_j\)-v(C∪\t_i\)-v(C∪\t_j\)+v(C). (8) Definition 4 (Token Pairwise Interaction). The overall interaction potential Ji​jJ_ij is then defined as the expected value of Δi​j​(C) _ij(C) over all possible contexts (subsets of C), weighted appropriately: Ji​j=∑C⊆∖ti,tjw​(C)​Δi​j​(C),J_ij= _C \t_i,t_j\w(C)\, _ij(C), (9) where w​(C)w(C) is a uniform weighting scheme. 3.7 Ising Model-based Token System Total Energy We now provide the total energy (or Hamiltonian) of a configuration system (an entire input sequence of tokens). However, before expressing of the total energy configuration system, we make the following clarification: Definition 5 (Coalition). A coalition, denoted by C⊆C , is any subset of tokens from the token set =t1,t2,…,tnT=\t_1,t_2,…,t_n\. The characteristic function v​(C)v(C) assigns a value to each coalition. Definition 6 (Spin Configuration). A spin configuration system which is denoted by S=(s1,s2,…,sn)∈−1,+1nS=(s_1,s_2,…,s_n)∈\-1,+1\^n, is a complete assignment of spin states to every token in the sequence C, where sis_i represents the activation state of token tit_i. Definition 7 (Token Energy). The total energy of the token system S is defined as: H​(S)=−∑i=1nJi​si−∑i=1n−1∑j=i+1nJi​j​si​sj.H(S)=- _i=1^nJ_is_i- _i=1^n-1 _j=i+1^nJ_ij\,s_is_j. (10) The variables si∈−1,+1s_i∈\-1,+1\ represents the “activation state” or “spin” of token tit_i. These activation states are estimated using the mean-field equation described in section 3.10. Definition 8 (Interpretation of Total Energy). The Hamiltonian H​(S)H(S) encapsulates the complete energetic state of the token configuration system S: • First term (−∑iJi​si- _iJ_is_i): represents the external field contribution. Tokens with high importance JiJ_i (positive external field) are energetically favored to be in the si=+1s_i=+1 (active) state, as this lowers the total energy. Conversely, tokens with low importance JiJ_i are favored to be inactive (si=−1s_i=-1). • Second term (−∑i=1n−1∑j=i+1nJi​j​si​sj- _i=1^n-1 _j=i+1^nJ_ij\,s_is_j): represents the pairwise interaction contribution. – If Ji​j>0J_ij>0 (synergistic interaction), the energy is minimized when sis_i and sjs_j have the same sign (+1,+1+1,+1 or −1,−1-1,-1), encouraging cooperative activation or deactivation. – If Ji​j<0J_ij<0 (antagonistic interaction), the energy is minimized when sis_i and sjs_j have opposite signs, encouraging competitive behavior where one token activates while the other suppresses. – If Ji​j=0J_ij=0, tokens i and j are independent. The system naturally seeks low-energy configurations, which correspond to coherent token activations that maximize overall semantic coherence. 3.8 Attention Weights We now provide a definition of attention weights assigned to each token within a token system configuration: Definition 9 (Coalitional Attention Weight). The attention weight αi _i of token tit_i is defined as the marginal probability that token tit_i is in the active state (si=+1s_i=+1) under the Gibbs distribution over all possible token system configurations S: αi=P​(si=+1)=∑S:si=+1P​(S), _i=P(s_i=+1)= _S:s_i=+1P(S), (11) where P​(S)P(S) is the Gibbs distribution: P​(S)=exp⁡(−H​(S)/γ)∑′exp⁡(−H​(S′)/γ).P(S)= (-H(S)/γ ) _S (-H(S )/γ ). (12) The attention weights satisfy the following properties: • Boundedness: Each attention weight is a probability and thus lies in the unit interval: 0≤αi≤1,∀i∈1,…,n.0≤ _i≤ 1, ∀ i∈\1,…,n\. (13) • Relation between αi _i and ⟨si⟩ s_i : Let p=P​(si=+1)p=P(s_i=+1). Then, the expected value of sis_i is: ⟨si⟩ s_i =(+1)​p+(−1)​(1−p)=2​p−1 =(+1)p+(-1)(1-p)=2p-1 ⇒p \;p =1+⟨si⟩2. = 1+ s_i 2. Since αi=p _i=p, we have αi=1+⟨si⟩2 _i= 1+ s_i 2. • Sum as Expected Active Tokens: The sum of attention weights equals the expected number of active tokens in the system: ∑i=1nαi=​[∑j=1nsj=+1]=​[k​()], _i=1^n _i=E [ _j=1^n1_s_j=+1 ]=E[k(S)], (14) where k​(S)=∑j=1nsj=+1k(S)= _j=1^n1_s_j=+1 denotes the number of active tokens in configuration S. Note that this sum is not necessarily equal to 1, unlike in standard attention mechanisms. Remark 3. Unlike conventional attention mechanisms where ∑iαi=1 _i _i=1 by construction, our framework imposes no such constraint. The quantity ∑iαi _i _i emerges naturally as the expected number of active tokens, reflecting the semantic complexity of the input sequence. Sequences requiring more active tokens to represent their meaning will have larger total attention. 3.9 NeuroGame Transformer: Physical Interpretation The NeuroGame Transformer establishes a direct analogy with statistical physics, summarized by the following correspondences: • Tokens: represent spins in a magnetic system. • Token Importance JiJ_i: represents the external magnetic field. • Interaction potential: Ji​jJ_ij exchange coupling between spins. • Temperature: γ is the system temperature. • Attention weights: αi _i represents a local magnetization. • Temperature Effects: When (γ→0γ→ 0), the distribution concentrates on the lowest-energy configuration, resulting in a deterministic attention where αi∈0,1 _i∈\0,1\ indicates whether token i is active in the ground state. When (γ→∞γ→∞), all configurations become equally probable, producing uniform attention with αi=1/2 _i=1/2 for all tokens i. In the intermediate scenario, the system balances energy minimization with thermal exploration, resulting in probabilistic attention that smoothly interpolates between these extremes. 3.10 Mean-Field Approximation The exact computation of the spin expectations ⟨si⟩ s_i requires solving the full statistical physics system defined by the Hamiltonian in Equation 10. Theoretically, these expectations are given by: ⟨si⟩=1Z∑Ssiexp(−H(S)/γ),Z=∑Sexp(−H(S))/γ), s_i = 1Z _Ss_i (-H(S)/γ ), Z= _S (-H(S))/γ ), (15) where the sums run over all 2n2^n possible spin configurations ∈−1,+1ns∈\-1,+1\^n. This exact computation is combinatorially intractable for any realistic sequence length, as it requires summing over an exponentially growing number of configurations. 3.10.1 Mean-Field Self-Consistency Equations To overcome this intractability, we employ the mean-field approximation, a standard technique from statistical physics that replaces fluctuating interactions with their average values. The following theorem provides a tractable solution to the determination of the n spin values associated to the n tokens: Theorem 1 (Mean-Field Self-Consistency). For a token system S=(s1,s2,…,sn)S=(s_1,s_2,...,s_n) governed by the Hamiltonian H​(S)H(S) with temperature parameter γ>0γ>0, the expected spins ⟨si⟩=​[si] s_i =E[s_i] under the Boltzmann distribution P​(S)∝exp⁡(−H​(S)/γ)P(S) (-H(S)/γ) satisfy the following self-consistent system of equations: ⟨si⟩=tanh⁡(1γ​(Ji+∑j≠iJi​j​⟨sj⟩)),i=1,…,n. s_i = ( 1γ (J_i+ _j≠ iJ_ij s_j ) ), i=1,…,n. (16) Proof. The proof proceeds in five steps. Step 1: Exact expression for ⟨si⟩ s_i . By definition, the expected value of sis_i under the Boltzmann distribution is: ⟨si⟩=1Z​∑Ssi​exp⁡(−H​(S)γ), s_i = 1Z _Ss_i (- H(S)γ ), (17) where Z=∑Sexp⁡(−H​(S)/γ)Z= _S (-H(S)/γ) is the partition function, and the sum runs over all 2n2^n configurations S=(s1,…,sn)∈−1,+1nS=(s_1,…,s_n)∈\-1,+1\^n. Step 2: Mean-field approximation. The mean-field approximation replaces the fluctuating spins sjs_j (j≠ij≠ i) in the interaction term with their average values ⟨sj⟩ s_j . This yields an effective Hamiltonian for token i: Hieff​(si)=−(Ji+∑j≠iJi​j​⟨sj⟩)​si.H_i^eff(s_i)=- (J_i+ _j≠ iJ_ij s_j )s_i. (18) The term ∑j≠iJi​j​⟨sj⟩ _j≠ iJ_ij s_j represents the mean field created by all other tokens acting on token i. Step 3: Effective field. Define the effective field acting on token i as: hieff=Ji+∑j≠iJi​j​⟨sj⟩.h_i^eff=J_i+ _j≠ iJ_ij s_j . (19) Under the mean-field approximation, token i behaves as an independent spin in this effective field, with Hamiltonian Hieff​(si)=−hieff​siH_i^eff(s_i)=-h_i^effs_i. Step 4: Single-spin distribution. For a single spin in an external field h, the Boltzmann distribution is: P​(si)=ehieff​si/γehieff/γ+e−hieff/γ.P(s_i)= e^h_i^effs_i/γe^h_i^eff/γ+e^-h_i^eff/γ. (20) Step 5: Computing the expectation. The expected value of sis_i under this distribution is: ⟨si⟩ s_i =(+1)⋅P​(si=+1)+(−1)⋅P​(si=−1) =(+1)· P(s_i=+1)+(-1)· P(s_i=-1) (21) =ehieff/γ−e−hieff/γehieff/γ+e−hieff/γ = e^h_i^eff/γ-e^-h_i^eff/γe^h_i^eff/γ+e^-h_i^eff/γ (22) =tanh⁡(hieffγ). = ( h_i^effγ ). (23) Substituting the definition of hieffh_i^eff yields: ⟨si⟩=tanh⁡(1γ​(Ji+∑j≠iJi​j​⟨sj⟩)). s_i = ( 1γ (J_i+ _j≠ iJ_ij s_j ) ). (24) This equation must hold simultaneously for all i=1,…,ni=1,…,n, forming a system of n coupled nonlinear equations. The solution gives the self-consistent mean-field approximation of the true n expected spins. ∎ Remark 4. Equation (16) is a fixed-point system. It can be solved iteratively by starting with an initial guess ⟨si⟩(0)\ s_i ^(0)\ and updating: ⟨si⟩(t+1)=tanh⁡(1γ​(Ji+∑j≠iJi​j​⟨sj⟩(t))), s_i ^(t+1)= ( 1γ (J_i+ _j≠ iJ_ij s_j ^(t) ) ), (25) until convergence. This iterative scheme typically converges within 10-20 fixed-point iterations for practical applications. This equation has a clear physical interpretation: the expected spin of token i is determined by the effective field hieff=Ji+∑j≠iJi​j​⟨sj⟩h_i^eff=J_i+ _j≠ iJ_ij s_j , which combines its own importance JiJ_i with the mean influence of all other tokens. 3.10.2 Computational Tractability The mean-field formulation reduces an exponential problem to a polynomial one. Table I shows the complexity comparison between an exact match and mean-field iterative technique. Method Complexity Feasibility for n=512n=512 Exact summation ​(2n)O(2^n) impossible (2512≈101542^512≈ 10^154 operations) Mean-field iteration ​(n2⋅T)O(n^2· T) tractable (∼106 10^6 operations with T=20T=20 iterations) TABLE I: Complexity comparison between exact and mean-field approaches. Equation 16 is solved via fixed-point iteration, which converges rapidly (typically within T=10T=10-2020 iterations) and scales quadratically with sequence length n, making it feasible even for transformers with thousands of tokens. 3.11 Single and Multi-Heads Attention Output We present in this section the attention output that invokes the attention weight αi _i for one single head as well as for multiple heads: 3.11.1 Single-Head Attention Output For a given input sequence of token representations ii=1n\x_i\_i=1^n, the attention output is computed as the weighted sum of value vectors: =∑i=1nαi​iz= _i=1^n _i\,v_i, where i=v​iv_i=W_vx_i is the value projection of token tit_i, and αi=P​(si=+1) _i=P(s_i=+1) are the attention weights derived from the marginal probabilities of the Gibbs distribution over token configurations. This formulation produces a contextualized representation z that aggregates information from all tokens in proportion to their game-theoretic importance, as captured by the external field JiJ_i and interaction potentials Ji​jJ_ij. Unlike standard transformer attention, the weights αi _i are not heuristic scores but rather emerge from the statistical physics of the token system, grounded in coalitional game theory. 3.11.2 Multi-Head Attention Output Following the transformer architecture, we extend the NeuroGame Attention to multiple heads operating in parallel. For each head h=1,…,Hh=1,…,H, we maintain independent parameters: • External fields Ji(h)J_i^(h) derived from head-specific Shapley and Banzhaf values with learnable mixing coefficients λi(h)=σ​(hT​i+bh) _i^(h)=σ(w_h^Tx_i+b_h) • Interaction potentials Ji​j(h)J_ij^(h) capturing head-specific pairwise token relationships • Temperature parameter γh _h controlling the sharpness of the head’s Gibbs distribution • Value projection matrix v(h)W_v^(h) projecting tokens into the head’s value space Each head independently solves its mean-field equations to obtain head-specific expected spins ⟨si⟩(h) s_i ^(h) and attention weights αi(h)=(1+⟨si⟩(h))/2 _i^(h)=(1+ s_i ^(h))/2. The output of head h is then: (h)=∑i=1nαi(h)​v(h)​i,(h)∈ℝdv.z^(h)= _i=1^n _i^(h)\,W_v^(h)x_i, ^(h) ^d_v. (26) The final multi-head output catz_cat is obtained by concatenating all head outputs and projecting back to the model dimension: cat=Concat⁡((1),(2),…,(H))​O,z_cat=Concat (z^(1),z^(2),…,z^(H) )W_O, (27) where O∈ℝH​dv×dmodelW_O ^Hd_v× d_model is the output projection matrix. This formulation enables each head to capture different interaction patterns and importance regimes, with the interpretability of αi(h) _i^(h) preserved at the head level. 3.12 Feedforward Network Classification The concatenated multi‑head output catz_cat is passed through a feedforward neural network with m layers. Let (0)=cath^(0)=z_cat. For layer ℓ=1,…,m =1,…,m: (ℓ)=Activation​((ℓ)​(ℓ−1)+(ℓ)),h^( )=Activation (W^( )h^( -1)+b^( ) ), (28) where (ℓ)W^( ) and (ℓ)b^( ) are learnable parameters. Common choices for activation are ReLU or GELU. The final layer (m)h^(m) is mapped to logits for the three classes: =(m+1)​(m)+(m+1)∈ℝ3.o=W^(m+1)h^(m)+b^(m+1) ^3. (29) The predicted class ω∗ω^* is obtained (via softmax), as: ωi∗=a​r​g​m​a​xωi​P​(ωi∣input)=a​r​m​a​xωi​[exp⁡(ωi)∑j=1cexp⁡(ωj)], _i^*=argmax_ _iP( _i )=armax_ _i [ (o_ _i) _j=1^c (o_ _j) ], (30) where c is the number of classes. 4 Two-Stage Approximation Framework Our NeuroGame Transformer employs two distinct approximation techniques to overcome combinatorial intractability at different levels of the framework. Understanding the distinction between these techniques is crucial for appreciating the computational feasibility of the model. 4.1 The Two Levels of Combinatorial Explosion The framework faces two separate combinatorial challenges: 1. Token-level (Game Theory): Computing the game-theoretic quantities requires summing over all possible coalitions C⊆∖tiC \t_i\ for external fields JiJ_i, and C⊆∖ti,tjC \t_i,t_j\ for interaction potentials Ji​jJ_ij, of which there are 2n−12^n-1 and 2n−22^n-2 coalitions respectively. 2. Spin-level (Statistical Physics): Computing expected spins ⟨si⟩ s_i requires summing over all possible spin configurations ∈−1,+1ns∈\-1,+1\^n, of which there are 2n2^n. Each challenge demands a different approximation strategy, as they operate on different mathematical objects and serve different purposes. 4.2 Approximation Methods We address the combinatorial complexity of game-theoretic computations via importance-weighted Monte Carlo sampling, and later handle the statistical physics component with mean-field equations. This section focuses on the sampling methods used to estimate the Shapley value, Banzhaf index, and pairwise interaction potentials that parameterize the Hamiltonian. 4.2.1 Importance-Weighted Monte Carlo Estimation Exact computation of Shapley value ϕi _i and Banzhaf index βi _i requires summing over all permutations or subsets, which is intractable for large n. To enable efficient estimation, we define a target distribution over coalitions that emphasizes semantically valuable subsets. Let the characteristic function v​(C)v(C) measure the semantic value of coalition C. The target distribution is Gibbsian: Ptarget​(C)∝exp⁡(v​(C)γ),P_target(C) \! ( v(C)γ ), (31) where γ is a temperature parameter. We sample coalitions from a uniform proposal distribution p​(C)p(C) (e.g., p​(C)=1/2n−1p(C)=1/2^n-1 for subsets). The importance weight for a sampled coalition CkC_k is v​(Ck)/p​(Ck)v(C_k)/p(C_k). After normalization over K samples, we obtain the following: w¯​(Ck)=exp⁡(v​(Ck)/γ)/p​(Ck)∑ℓ=1Kexp⁡(v​(Cℓ)/γ)/p​(Cℓ), w(C_k)= \! (v(C_k)/γ )/p(C_k) _ =1^K \! (v(C_ )/γ )/p(C_ ), (32) where the intractable partition function cancels. These normalized weights are used to form unbiased estimates of the game-theoretic quantities. We now derive the estimators of Shapley value, Banzhaf power index and the pairwise interaction: • Shapley value: We estimate ϕi _i via permutation-based sampling. For each sample k, draw a random permutation of the tokens in ∖tiT \t_i\ and let i(k)P_i^(k) be the set of tokens preceding tit_i. The estimator is: ϕ^i=∑k=1Kw¯​(i(k))​[v​(i(k)∪ti)−v​(i(k))], φ_i= _k=1^K w(P_i^(k)) [v(P_i^(k)∪\t_i\)-v(P_i^(k)) ], (33) where w¯​(i(k)) w(P_i^(k)) is computed using the uniform proposal probability of that set under permutation sampling. For a set ik​(π)P_i^k(π) (the tokens that precede tit_i in the permutation π) of size |i(k)​(π)||P_i^(k)(π)|): p​(i(k)​(π))=|i(k)​(π)|!×(n−1−|i(k)​(π)|)!(n−1)!,p(P_i^(k)(π))= |P_i^(k)(π)|!×(n-1-|P_i^(k)(π)|)!(n-1)!, (34) where n is the total number of tokens. • Banzhaf index: We sample coalitions Ck⊆∖tiC_k \t_i\ by including each token independently with probability 1/21/2. The estimator is written as: β^i=∑k=1Kw¯​(Ck)​[v​(Ck∪ti)−v​(Ck)], β_i= _k=1^K w(C_k) [v(C_k∪\t_i\)-v(C_k) ], (35) with p​(Ck)=1/2n−1p(C_k)=1/2^n-1. • Pairwise interaction: For each token pair (ti,tj)(t_i,t_j), we sample coalitions Ck⊆∖ti,tjC_k \t_i,t_j\ uniformly (each token included with probability 1/21/2) and estimate J^i​j=∑k=1Kw¯​(Ck)​Δi​j​(Ck), J_ij= _k=1^K w(C_k) _ij(C_k), (36) with w¯​(Ck) w(C_k) computed as above using p​(Ck)=1/2n−2p(C_k)=1/2^n-2. These importance-weighted Monte Carlo estimates handle the combinatorial explosion in coalition space, are unbiased, and remain numerically stable for large n due to the cancellation of the partition function in the normalized weights. 4.2.2 Mean Field Equations and Attention Weights The mean-field equations solve for the expected spins ⟨si⟩ s_i viaan iterative process of a system of n coupled nonlinear equations: ⟨si⟩=tanh⁡(1γ​(Ji+∑j≠iJ^i​j​⟨sj⟩)),i=1,…,n, s_i = ( 1γ (J_i+ _j≠ i J_ij s_j ) ), i=1,…,n, (37) avoiding the intractable sum over 2n2^n spin configurations. This fixed-point iteration typically converges within 10–20 steps. The attention weights are computed as the marginal probabilities of each token being active: αi=1+⟨si⟩2 _i= 1+ s_i 2. These weights are then used in the standard attention output formula =∑iαi​iz= _i _iv_i, where iv_i are value projections of the tokens. 4.3 Asymptotic Behavior and Complexity With a sufficiently large number of Monte Carlo samples K, the estimators ϕ^i φ_i (Shapley value), β^i β_i (Banzhaf index), and J^i​j J_ij (interaction potential) are consistent estimators of their respective true values under importance-weighted sampling with Gibbs-distributed weights. Under mild regularity assumptions on the characteristic function v​(⋅)v(·) and the proposal distribution, these estimators converge almost surely to their true values as K→∞K→∞. The use of normalized weights ensures that the intractable partition function of the Gibbs distribution cancels naturally, while avoiding explicit 2n2^n factors that would cause numerical instability for large n. Gibbs-distributed importance sampling concentrates samples on high-value coalitions that contribute most to marginal gains, improving estimator efficiency. The self-normalized nature of the weights yields asymptotically unbiased estimates with variance that decreases at the standard Monte Carlo rate of ​(1/K)O(1/ K). More importantly, the computational complexity per token scales as ​(K)O(K), independent of the total number of possible coalitions (2n−12^n-1 for Shapley and Banzhaf, 2n−22^n-2 for interaction potentials). This represents an exponential reduction compared to exact enumeration and renders Monte Carlo estimation feasible for practical transformer attention layers, even for long sequences where n is large (e.g., n=512n=512 or more). The trade-off between computational cost and estimation accuracy is controlled by K, with typical values ranging from K=100K=100 to K=1000K=1000 providing sufficient accuracy in practice. Following the game theory phase, the mean-field equations for the spin system are solved via a fixed-point iterative process requires ​(n2⋅iter)O(n^2·iter) operations, where T is the number of iterations (typically 1010-2020). This two-stage pipeline—Monte Carlo estimation followed by mean-field fixed-point iteration—achieves overall complexity (K⋅n+n2⋅TO(K· n+n^2· T, which is polynomial in n and linear in K, in contrast to the exponential ​(2n)O(2^n) cost of exact computation for either stage alone. 5 Example of Potentials and Spin States Estimation We illustrate the complete estimation pipeline with a minimal example involving three tokens: t1t_1, t2t_2, and t3t_3, representing the words “not”, “good”, and “movie” in the phrase “not good movie”. 5.1 Game-Theoretic Quantities via Monte Carlo Assume a characteristic function v​(C)v(C) that returns higher values for semantically coherent coalitions. For this example, we define: v​(t1) v(\t_1\) =0.2,v​(t2)=0.5,v​(t3)=0.4, =0.2, v(\t_2\)=0.5, v(\t_3\)=0.4, v​(t1,t2) v(\t_1,t_2\) =1.2,v​(t1,t3)=0.8,v​(t2,t3)=1.0, =1.2, v(\t_1,t_3\)=0.8, v(\t_2,t_3\)=1.0, v​(t1,t2,t3) v(\t_1,t_2,t_3\) =1.8,v​(∅)=0. =1.8, v( )=0. Using K=3K=3 Monte Carlo samples with Gibbs-distributed importance weights at temperature γ=1.0γ=1.0, we sample coalitions and compute normalized weights. Suppose we sample: C1 C_1 =t2,v​(C1)=0.5,exp⁡(v​(C1)/γ)=e0.5=1.65, =\t_2\, v(C_1)=0.5, (v(C_1)/γ)=e^0.5=1.65, C2 C_2 =t1,t3,v​(C2)=0.8,exp⁡(v​(C2)/γ)=e0.8=2.23, =\t_1,t_3\, v(C_2)=0.8, (v(C_2)/γ)=e^0.8=2.23, C3 C_3 =t2,t3,v​(C3)=1.0,exp⁡(v​(C3)/γ)=e1.0=2.72. =\t_2,t_3\, v(C_3)=1.0, (v(C_3)/γ)=e^1.0=2.72. The normalizing constant is: Z=1.65+2.23+2.72=6.60Z=1.65+2.23+2.72=6.60, yielding weights: w¯​(C1)=1.65/6.60=0.25,w¯​(C2)=2.23/6.60=0.34, w(C_1)=1.65/6.60=0.25, w(C_2)=2.23/6.60=0.34, w¯​(C3)=2.72/6.60=0.41. w(C_3)=2.72/6.60=0.41. 5.1.1 Shapley Value and Banzhaf Estimation • Shapley Value: For token t2t_2, we consider permutations of the remaining tokens t1,t3\t_1,t_3\. From the sampled coalitions, we compute marginal contributions: For C1=t2:Δ2​(C1)=v​(t2)−v​(∅)=0.5C_1=\t_2\: _2(C_1)=v(\t_2\)-v( )=0.5, For C2=t1,t3:Δ2​(C2)=v​(t1,t2,t3)−v​(t1,t3)=1.8−0.8=1.0C_2=\t_1,t_3\: _2(C_2)=v(\t_1,t_2,t_3\)-v(\t_1,t_3\)\\ =1.8-0.8=1.0, For C3=t2,t3:Δ2​(C3)=v​(t2,t3)−v​(t3)=1.0−0.4=0.6C_3=\t_2,t_3\: _2(C_3)=v(\t_2,t_3\)-v(\t_3\)=1.0-0.4=0.6. The Shapley estimate is the weighted sum: ϕ^2=0.25×0.5+0.34×1.0+0.41×0.6=0.125+0.34+0.246=0.711 φ_2=0.25× 0.5+0.34× 1.0+0.41× 0.6=0.125+0.34+0.246=0.711. • Banzhaf Index Estimation (Bernoulli Sampling): For the same token t2t_2, using the same coalitions and weights, the Banzhaf estimate is identical in form but with different theoretical interpretation: β^2=0.711 β_2=0.711 as well (the difference lies in the target distribution, not the weighted sum). 5.1.2 External Magnetic Fields and Pairwise Interaction • External Magnetic Fields: With λ2=0.6 _2=0.6 (learned from token t2t_2’s context), the external field for token t2t_2 is: J2=λ2​ϕ^2norm+(1−λ2)​β^2norm=0.6×0.711+0.4×0.711=0.711J_2= _2 φ_2^norm+(1- _2) β_2^norm=0.6× 0.711+0.4× 0.711=0.711. Similarly, suppose we obtain J1=0.423J_1=0.423, J3=0.512J_3=0.512, and interaction potentials J^12=0.466 J_12=0.466, J^13=0.312 J_13=0.312, J^23=0.278 J_23=0.278. • Pairwise Interaction: For the pair (t1,t2)(t_1,t_2), we compute marginal interactions Δ12​(C1)=v​(t1,t2)−v​(t1)−v​(t2)+v​(∅)=1.2−0.2−0.5+0=0.5,Δ12​(C2)=v​(t1,t2,t3)−v​(t1,t3)−v​(t2,t3)+v​(t3)=1.8−0.8−1.0+0.4=0.4,Δ12​(C3)=v​(t1,t2)−v​(t1)−v​(t2)+v​(∅)=1.2−0.2−0.5+0=0.5 _12(C_1)=v(\t_1,t_2\)-v(\t_1\)-v(\t_2\)+v( )=1.2-0.2-0.5+0=0.5,\\ _12(C_2)=v(\t_1,t_2,t_3\)-v(\t_1,t_3\)-v(\t_2,t_3\)+v(\t_3\)=1.8-0.8-1.0+0.4=0.4,\\ _12(C_3)=v(\t_1,t_2\)-v(\t_1\)-v(\t_2\)+v( )=1.2-0.2-0.5+0=0.5. The estimated interaction potential is J^12=0.25×0.5+0.34×0.4+0.41×0.5=0.125+0.136+0.205=0.466 J_12=0.25× 0.5+0.34× 0.4+0.41× 0.5=0.125+0.136+0.205=0.466. 5.2 Mean-Field Spin Estimation With temperature γ=1.0γ=1.0, we solve the mean-field equations via fixed-point iteration starting from ⟨si⟩(0)=0 s_i ^(0)=0: ⟨s1⟩(1)=tanh⁡(J1+J12​⟨s2⟩(0)+J13​⟨s3⟩(0))=0.400,⟨s2⟩(1)=tanh⁡(J2+J12​⟨s1⟩(0)+J23​⟨s3⟩(0))=0.611,⟨s3⟩(1)=tanh⁡(J3+J13​⟨s1⟩(0)+J23​⟨s2⟩(0))=0.471 s_1 ^(1)= (J_1+J_12 s_2 ^(0)+J_13 s_3 ^(0))=0.400,\\ s_2 ^(1)= (J_2+J_12 s_1 ^(0)+J_23 s_3 ^(0))=0.611,\\ s_3 ^(1)= (J_3+J_13 s_1 ^(0)+J_23 s_2 ^(0))=0.471. Iteration 2 uses the updated values: ⟨s1⟩(2)=tanh⁡(0.423+0.466×0.611+0.312×0.471)=0.693,⟨s2⟩(2)=tanh⁡(0.711+0.466×0.400+0.278×0.471)=0.773,⟨s3⟩(2)=tanh⁡(0.512+0.312×0.400+0.278×0.611)=0.668 s_1 ^(2)= (0.423+0.466× 0.611+0.312× 0.471)\\ =0.693, s_2 ^(2)= (0.711+0.466× 0.400+0.278× 0.471)=0.773, s_3 ^(2)= (0.512+0.312× 0.400+0.278× 0.611)=0.668. After T=5T=5 iterations, the values converge to: ⟨s1⟩=0.721,⟨s2⟩=0.798,⟨s3⟩=0.703 s_1 =0.721, s_2 =0.798, s_3 =0.703. Finally, using the attention weights equation: αi=(1+⟨si⟩)/2 _i=(1+ s_i )/2, we obtain: α1=1+0.7212=0.861,α2=1+0.7982=0.899,andα3=1+0.7032=0.852 _1= 1+0.7212=0.861,\\ _2= 1+0.7982=0.899, _3= 1+0.7032=0.852. These weights reflect the semantic importance of each token in the phrase, with “good” (t2t_2) receiving highest attention, followed by “not” (t1t_1) and “movie” (t3t_3), consistent with the phrase’s sentiment structure where the negation-modifier relationship is captured through the interaction potentials. 6 Comparison of NGT and Vanilla Table I summarizes the key architectural and conceptual differences between the standard Transformer and the NeuroGame Transformer. We now provide a brief interpretation of each distinguishing feature. TABLE I: Comparison of Standard Transformer and NeuroGame Transformer (NGT) Features Feature Standard Transformer (Vanilla) NeuroGame Transformer (NGT) Core Mechanism Scaled Dot-Product Attention: Measures similarity/correlation between Q and K. Game-Theoretic Importance: External fields JiJ_i combine Shapley (ϕi _i) and Banzhaf (βi _i). Sampling Method Deterministic (processes all tokens at once). Importance-Weighted Monte Carlo: Uniform proposal with Gibbs-distributed weights w¯k w_k. Gating System None (uses fixed attention heads). λi _i Gating: λi=σ​(T​i+b) _i=σ(w^Tx_i+b) balances Shapley and Banzhaf contributions. Token Interactions None beyond pairwise dot-products. Pairwise potentials Ji​jJ_ij capture synergistic/antagonistic relationships. Physical Foundation None (purely algorithmic). Ising Hamiltonian H​(S)=−∑i=1nJi​si−∑i=1n−1∑j=i+1nJi​j​si​sjH(S)=- _i=1^nJ_is_i- _i=1^n-1 _j=i+1^nJ_ij\,s_is_j. Attention Derivation Softmax of similarity scores. αi=1+⟨si⟩2 _i= 1+ s_i 2 from mean-field spin expectations ⟨si⟩ s_i . Interpretability Heuristic: Shows ”what” was attended. Axiomatic: Weights represent provably fair semantic contribution. NLI Logic Handling Pattern matching; struggles with negations. Interaction-based: Catches ”swing voters” via Banzhaf and Ji​jJ_ij. Complexity O​(n2)O(n^2) matrix multiplication. O​(K⋅n+n2⋅T)O(K· n+n^2· T) with K Monte Carlo samples, T mean-field iterations. • Core Mechanism: Standard attention measures similarity between token pairs via dot-products. NGT measures contribution via game-theoretic values: Shapley (ϕi _i) captures fair global contribution, while Banzhaf (βi _i) identifies decisive local influence. This shifts attention from “how related are these tokens?” to “how much does this token contribute to the whole?” • Sampling Method: Standard attention processes all tokens deterministically. NGT uses importance-weighted Monte Carlo with Gibbs-distributed weights w¯k=ev​(Ck)/γ/∑ℓev​(Cℓ)/γ w_k=e^v(C_k)/γ/ _ e^v(C_ )/γ, making exponential coalition spaces tractable while focusing on semantically dense coalitions. • Gating System: The learnable parameter λi=σ​(T​i+b) _i=σ(w^Tx_i+b) dynamically balances Shapley (global fairness) and Banzhaf (local decisiveness) based on token context—a key architectural innovation. • Token Interactions: Standard attention captures only pairwise similarities. NGT introduces explicit interaction potentials Ji​jJ_ij, estimated via Monte Carlo, capturing whether token pairs are synergistic (Ji​j>0J_ij>0), antagonistic (Ji​j<0J_ij<0), or independent (Ji​j≈0J_ij≈ 0). • Physical Foundation: Standard attention is purely algorithmic. NGT is grounded in an Ising Hamiltonian H​(S)=−∑i=1nJi​si−∑i=1n−1∑j=i+1nJi​j​si​sjH(S)=- _i=1^nJ_is_i- _i=1^n-1 _j=i+1^nJ_ij\,s_is_j, where each token is a spin si∈−1,+1s_i∈\-1,+1\, bringing statistical physics principles to attention. • Attention Derivation: Standard attention uses softmax of similarity scores. NGT first solves mean-field equations ⟨si⟩=tanh⁡(1γ​(Ji+∑j≠iJi​j​⟨sj⟩)) s_i = ( 1γ(J_i+ _j≠ iJ_ij s_j )) for expected spins, then derives attention as αi=(1+⟨si⟩)/2 _i=(1+ s_i )/2—the probability a token is active in thermodynamic equilibrium. • Interpretability: Standard attention weights are heuristic post-hoc explanations. NGT weights are axiomatic, derived from game-theoretic solution concepts that satisfy formal fairness axioms—they represent a provable“fair share” of semantic contribution. • NLI Logic Handling: Standard attention relies on pattern matching, struggling with negations. NGT’s Banzhaf component explicitly identifies ”swing voters” (words that flip logical relationships), while Ji​jJ_ij captures how such words interact, enabling robust handling of contradiction and negation. • Complexity: Standard attention is O​(n2)O(n^2). NGT is O​(K⋅n+n2⋅T)O(K· n+n^2· T) with K Monte Carlo samples (100-1000) and T mean-field iterations (10-20)—feasible for long sequences while avoiding exponential costs. Finally, Algorithm 1 depicts the NeuroGame Transformer (NGT) attention: Algorithm 1 NeuroGame Transformer Attention 1:∈ℝd×nX ^d× n: token embeddings 2:v∈ℝd×dW_v ^d× d: value projection 3:∈ℝd,b∈ℝw ^d,b : gating parameters 4:γ>0γ>0: temperature, K: MC samples, T: MF iterations, ϵε: tolerance 5:∈ℝdz ^d: output, ∈ℝn α ^n: weights, field∈ℝnJ_field ^n, inter∈ℝn×nJ_inter ^n× n 6:←v​V _vX // Projected values 7:// Token-specific gating 8:for i=1i=1 to n do 9: λi←σ​(T​[:,i]+b) _i←σ(w^TX[:,i]+b) 10:end for 11:// Initialize estimators 12:←n 0_n, ←n β 0_n, inter←n×nJ_inter 0_n× n 13:// Shapley values (permutation sampling) 14:for i=1i=1 to n do 15: ,ℳ←∅W,M← 16: for k=1k=1 to K do 17: Random permutation π of 1,…,n∖i\1,…,n\ \i\; 18: r∼Uniform​0,…,n−1r \0,…,n-1\ 19: ←P← first r tokens in π, s←||s←|P|, p←s!​(n−1−s)!(n−1)!p← s!(n-1-s)!(n-1)!; 20: vp←‖∑j∈​[:,j]‖2v_p←\| _j V[:,j]\|_2, Δ←‖∑j∈∪i​[:,j]‖2−vp ←\| _j ∪\i\V[:,j]\|_2-v_p 21: wraw←exp⁡(vp/γ)/pw_raw← (v_p/γ)/p; append to W, Δ to ℳM 22: end for 23: Normalize W to w¯k\ w_k\; ​[i]←∑k=1Kw¯k​ℳk [i]← _k=1^K w_kM_k 24:end for 25:// Banzhaf indices (Bernoulli sampling) 26:for i=1i=1 to n do 27: ,ℳ←∅W,M← 28: for k=1k=1 to K do 29: ⊆1,…,n∖iC \1,…,n\ \i\ via Bernoulli(1/2)(1/2); p←1/2n−1p← 1/2^n-1 30: vc←‖∑j∈​[:,j]‖2v_c←\| _j V[:,j]\|_2, Δ←‖∑j∈∪i​[:,j]‖2−vc ←\| _j ∪\i\V[:,j]\|_2-v_c 31: wraw←exp⁡(vc/γ)/pw_raw← (v_c/γ)/p; append to W, Δ to ℳM 32: end for 33: Normalize W; ​[i]←∑k=1Kw¯k​ℳk β[i]← _k=1^K w_kM_k 34:end for 35:// Pairwise interactions 36:for i=1i=1 to n do 37: for j=i+1j=i+1 to n do 38: ,←∅W,D← 39: for k=1k=1 to K do 40: ⊆1,…,n∖i,jC \1,…,n\ \i,j\ via Bernoulli(1/2)(1/2); p←1/2n−2p← 1/2^n-2 41: vc←‖∑m∈​[:,m]‖2v_c←\| _m V[:,m]\|_2, vi←‖∑m∈∪i​[:,m]‖2v_i←\| _m ∪\i\V[:,m]\|_2 42: vj←‖∑m∈∪j​[:,m]‖2v_j←\| _m ∪\j\V[:,m]\|_2, 43: vi​j←‖∑m∈∪i,j​[:,m]‖2v_ij←\| _m ∪\i,j\V[:,m]\|_2 44: Δ←vi​j−vi−vj+vc ← v_ij-v_i-v_j+v_c 45: wraw←exp⁡(vc/γ)/pw_raw← (v_c/γ)/p; append to W, Δ to D 46: end for 47: Normalize W, inter​[i,j]←∑k=1Kw¯k​kJ_inter[i,j]← _k=1^K w_kD_k, 48: inter​[j,i]←inter​[i,j]J_inter[j,i] _inter[i,j] 49: end for 50:end for 51:// Normalize and combine 52:norm←/‖1 ^norm← /\| \|_1, norm←/‖1 β^norm← β/\| β\|_1 53:for i=1i=1 to n do 54: Ji←λi​inorm+(1−λi)​inormJ_i← _i ^norm_i+(1- _i) β^norm_i 55:end for 56:field←[J1,…,Jn]J_field←[J_1,…,J_n] 57:// Mean-field iteration 58:←ns 0_n 59:for t=1t=1 to T do 60: old←s_old 61: for i=1i=1 to n do 62: hieff←Ji+∑j≠iinter​[i,j]​old​[j]h_i^eff← J_i+ _j≠ iJ_inter[i,j]s_old[j], ​[i]←tanh⁡(hieff/γ)s[i]← (h_i^eff/γ) 63: end for 64: if maxi⁡|​[i]−old​[i]|<ϵ _i|s[i]-s_old[i]|<ε then 65: break 66: end if 67:end for 68:// Output 69:for i=1i=1 to n do 70: αi←(1+​[i])/2 _i←(1+s[i])/2 71:end for 72:←[α1,…,αn] α←[ _1,…, _n], ←∑i=1nαi​[:,i]z← _i=1^n _iV[:,i] 73:return ,,field,interz, α,J_field,J_inter 7 Experiments 7.1 Problem Definition and Datasets The problem we are addressing consists in testing the ability of NeuroGame Transformer’s to determine logical relationships (entailment, contradiction, neutrality) between sentences, requiring deep semantic understanding and contradiction detection. Therefore, NeuroGame Transformer receives as input a pair of sentences: a premise and a hypothesis. It should predict the best logical relationship between these two inputs. In other words, NeuroGame Transformer is solving a classification problem given a predefined number of classes such as: entailment, neutral and contradiction. We evaluate the performance of NeuroGame Transformer on the Natural Language Inference (NLI) problem. This assessment is undertaken on the SNLI [6], and MNLI-matched [32] datasets, for testing robustness to spurious patterns. Table I depicts a description of each of the three different datasets: TABLE I: Comparison of SNLI, and MNLI-matched Aspect SNLI MNLI-matched Purpose Large-scale NLI training [6] Multi-genre NLI [32] Size 570k pairs 433k pairs Labels Entailment, contradiction, neutral Source Image captions 10 diverse genres Evaluation In-distribution Matched/mismatched Challenge General understanding Cross-genre We present examples from each dataset made available to address the Natural Language Inference problem. Table IV illustrates examples from SNLI, and MNLI-matched datasets: TABLE IV: Concrete examples across datasets Dataset Premise Hypothesis Label SNLI “A man is being interviewed by a reporter.” “A reporter is interviewing a man.” entailment SNLI “The doctor was paid by the actor.” “The doctor paid the actor.” contradiction MNLI-matched “The new rights are supposedly why we’re here.” (from letters genre) “We are here because of new rights.” entailment MNLI-matched “The cat hid under the bed during the storm.” (from fiction genre) “The cat was outside in the rain.” contradiction The three datasets serve complementary roles. SNLI provides large-scale training examples from image captions, offering a foundation for learning natural linguistic phenomena [6]. MNLI extends this with premises from ten diverse genres—including fiction, speech, and government reports—testing cross-genre generalization [32]. Both require genuine inference, e.g., understanding that “A man is being interviewed” entails “A reporter is interviewing a man”. Together, these datasets provide a comprehensive framework: SNLI assesses basic NLI capability, and MNLI-matched tests domain generalization. 7.2 Baseline Models We evaluate our NeuroGame Transformer (NGT) against several strong baselines on the SNLI and MNLI‑matched benchmarks as described in [13]. The selected models represent key architectural developments in natural language inference, ranging from dedicated sequence‑based encoders to large pretrained transformers. Table V reports their test accuracies as cited in the literature. • DAM (Decomposable Attention Model) [26] is a simple yet effective model that averages word embeddings and passes the result through one or more feedforward layers. Despite its simplicity, it provides a competitive baseline for sentence representation tasks. • ESIM (Enhanced Sequential Inference Model) [7] is a dedicated NLI architecture that uses bidirectional LSTMs with local inference modeling and inference composition. It achieved state‑of‑the‑art performance at the time of its release. • BERT‑Base [10] (Bidirectional Encode Representation from Transformers) pretrained on large corpora using masked language modeling and next sentence prediction. Its deep architecture (12 layers, 768 hidden size) has become a standard baseline for natural language understanding tasks. We report the fine‑tuned performance on SNLI and MNLI. • BERT‑Large [10] scales the architecture to 24 layers and 1024 hidden units, providing higher capacity at the cost of increased computation. Its performance on both datasets demonstrates the benefit of larger pretrained models. • RoBERTa‑Base [21] (Robustly Optimized BERT Approach) improves upon BERT with more dynamic masking, larger mini‑batches, and removal of the next sentence prediction objective, leading to stronger representations across a range of benchmarks. We include both base and large variants. • RoBERTa‑Large [21] further increases model size and training data, achieving state‑of‑the‑art results on many NLP tasks. Its performance on MNLI is particularly strong. • ALBERT‑Base [19] (A Lite BERT) reduces memory footprint through factorized embedding parameterization and cross‑layer parameter sharing, while maintaining performance through a self‑supervised loss for sentence‑order prediction. The base version offers a good trade‑off between efficiency and accuracy. • ALBERT‑Large [19] scales the architecture similarly to BERT‑large but with parameter sharing, achieving competitive results with fewer parameters. All baseline results are taken from the comprehensive evaluation reported in [13]. For fair comparison, we use the same preprocessing pipeline and evaluation protocol across all models. NGT is compared against these baselines under identical conditions. 7.3 NGT Implementation Details We now detail the NeuroGame Transformer implementation used for our experiments. • Model Architecture: NGT employs a BERT-base encoder to produce contextualized token embeddings. On top of these, we incorporate two intertwined sets of components: game-theoretic estimators and a statistical physics layer. • Game-theoretic components: Shapley values and Banzhaf indices are estimated via importance-weighted Monte Carlo sampling with a Gibbs target and uniform proposal. We use Kmc=15K_mc=15 coalitions during training and Kmc=25K_mc=25 during evaluation. These estimates define the local fields JiJ_i and pairwise interactions Ji​jJ_ij that characterize the coalitional game through the characteristic function v. • Statistical physics components: A spin projection layer maps the set of token embeddings to the set of Ising spins si∈[−1,1]s_i∈[-1,1]. The mean‑field fixed‑point equations are solved with temperature γ=0.25γ=0.25, damping factor 0.70.7, Tmf=25T_mf=25 iterations, and tolerance 10−410^-4. The resulting equilibrium spins ⟨si⟩ s_i are passed to the classifier, while the game‑theoretic quantities provide interpretable attributions. • Training Details: We train for five epochs on SNLI, and MNLI-matched, using a learning rate of 3×10−53× 10^-5, gradient accumulation over 2 steps (effective batch size 32), and weight decay 0.020.02. The AdamW optimizer [22] is used with linear warmup over 10% of steps, followed by a MultiStepLR scheduler reducing the learning rate by a factor of 0.10.1 after epochs 3 and 4. Gradient clipping (norm 1.01.0) is applied. • Regularization: We apply dropout (p=0.15p=0.15), label smoothing (0.10.1), mixup (αmix=0.2 _mix=0.2), and exponential moving average (EMA) with decay 0.9990.999. All sequences are truncated/padded to 128 tokens using the BERT tokenizer. 7.4 Main Results and Interpretation We evaluate the NeuroGame Transformer (NGT) on the SNLI and MNLI‑matched benchmarks, comparing its performance against several established deep learning models. Table V summarizes the test accuracies along with the approximate number of parameters for each model. On the SNLI dataset, NGT achieves a test accuracy of 86.60% with approximately 110 million parameters—adding only 754,000 parameters (only 0.7% overhead) to the standard BERT‑Base architecture. This result is competitive with strong baselines such as BERT‑Base (88.86%, 110M), ALBERT‑Base (86.49%, 11M), and ESIM (87.00%, 4.3M). Notably, NGT outperforms earlier sequence‑based models like DAM (83.30%, 382K) and matches the performance of more complex architectures while using a fraction of the parameters of larger models (e.g., BERT‑Large uses 340M, RoBERTa‑Large uses 355M). This demonstrates that integrating game‑theoretic attributions with a mean‑field Ising model yields high representational efficiency with minimal computational overhead. TABLE V: Test accuracy (%) on SNLI and MNLI datasets. All accuracy values are taken from [13]. Parameter counts are standard values from the respective model papers. Model SNLI MNLI-Matched Parameters (approx.) DAM 83.30 68.41 382K ESIM 87.00 76.63 4.3M BERT-Base 88.86 80.99 110M BERT-Large 90.33 84.83 340M RoBERTa-Base 86.95 85.98 125M RoBERTa-Large 91.83 90.20 355M ALBERT-Base 86.49 79.89 11M ALBERT-Large 90.85 86.76 17M NGT (Our Model) 86.60 79.00 110M On the more challenging MNLI‑matched dataset, NGT reaches 79.00% accuracy, which is close to ALBERT‑Base (79.89%, 11M) and within 2 percentage points of BERT‑Base (80.99%, 110M). Given that MNLI covers ten diverse text genres, this result highlights the robustness of NGT’s game‑theoretic components across varied linguistic domains. The gap between NGT and larger models like RoBERTa‑Base (85.98%, 125M) and RoBERTa‑Large (90.20%, 355M) is expected, as those models benefit from more extensive pretraining and larger architectures. However, NGT achieves these results with only 0.7% additional parameters beyond BERT‑Base, demonstrating that its enhancements are parameter‑efficient. Remarkably, NGT achieves these results with only a BERT‑Base backbone (i.e., only at initialization), demonstrating that its game‑theoretic enhancements—Shapley values, Banzhaf indices, pairwise interaction potentials, and mean‑field spin equilibria—boost representational power without requiring larger architectures or additional pretraining. The pairwise interaction matrix Ji​jJ_ij is particularly revealing: unlike standard attention mechanisms that only capture pairwise similarity, Ji​jJ_ij explicitly models whether token pairs cooperate (positive values) or compete (negative values). For example, in the phrase “not good movie,” the interaction between “not” and “good” is strongly negative, reflecting the antagonistic relationship that flips sentiment. In contrast, a phrase like “very good movie” would show positive interactions between “very” and “good,” indicating synergy. Standard attention mechanisms aggregate these relationships into a single scalar weight, losing the nuanced distinction between cooperation and competition. Similarly, the Banzhaf index helps identify “swing voters”—tokens that critically influence logical relationships, such as negation words—which standard attention often downweights. The Shapley values provide token‑level attributions that are both globally consistent and locally faithful, enabling fine‑grained analysis of model decisions. As with most NLI models, the “Neutral” class remains challenging, but NGT’s enhanced capacity improves its validation accuracy substantially. The hyperparameter configuration (128 spins, 25 mean‑field iterations, Kmc=15/25K_mc=15/25) balances capacity and stability, with regularization techniques (label smoothing, mixup, EMA) aiding generalization. Together, these components position NGT as a compelling interpretable alternative to standard transformers. It delivers competitive accuracy with minimal overhead (only 0.7% additional parameters), while providing transparency into its decision‑making process—a critical step toward advancing trustworthy AI. The explicit modeling of pairwise interactions and the ability to distinguish cooperative from competitive token relationships are features that no other model in the comparison can offer, making NGT uniquely suited for tasks requiring deep linguistic understanding. Furthermore, the improved performance on the challenging “Neutral” class stems from three game‑theoretic components in NGT. First, the Banzhaf index identifies “swing voters” and decreases the influence of spurious words that might otherwise lead to misclassification. Second, the pairwise interaction matrix Ji​jJ_ij captures the absence of strong token relationships in Neutral examples, encoding them as near‑zero values—a signal standard attention cannot provide. Third, Shapley values ensure fair credit distribution across tokens, preventing over‑attribution to salient but non‑decisive words. Together, these mechanisms enable NGT to recognize not only when a logical relationship exists, but critically, when it does not. 8 Conclusion and Future Work We introduced the NeuroGame Transformer (NGT), a novel attention mechanism that reformulates token interactions as a coalitional game grounded in statistical physics. By combining Shapley values (global fairness) with Banzhaf power indices (local sensitivity), NGT provides a principled framework for semantic attribution. The mean‑field Ising model, parameterized by local fields JiJ_i and pairwise interactions Ji​jJ_ij, yields equilibrium spin states ⟨si⟩ s_i that serve as interpretable token representations. Importance‑weighted Monte Carlo sampling with a Gibbs target and uniform proposal ensures tractable estimation of these game‑theoretic quantities even for long sequences. Experimental results demonstrate the effectiveness of our approach. In fact, they validate that our game‑theoretic formulation successfully addresses the “right for the wrong reasons” pitfall common in standard attention mechanisms. Beyond performance gains, NGT offers enhanced interpretability through its game‑theoretic importance scores. The local fields JiJ_i provide token‑level attributions grounded in cooperative game theory, while the interaction matrix Ji​jJ_ij reveals synergistic relationships between tokens that standard attention weights cannot capture. The NeuroGame Transformer represents a foundational paradigm shift rather than a final architecture; its formalism opens numerous avenues for further refinement, including more efficient Monte Carlo estimators, adaptive temperature scheduling, and integration with larger pretrained models. Future research directions include: (i) Multi-modal extensions: Applying NeuroGame attention to vision‑language transformers and cross‑modal reasoning [25]. (i) Bayesian exploration: Developing fully Bayesian sequence models with probabilistic dependencies to improve predictive performance [3]. References [1] H. AlersValentin, S. Fong, and J.F. VegaRiveros (2023) Modeling syntactic knowledge with neuro-symbolic computation. In Proceedings of the 15th International Conference on Agents and Artificial Intelligence (ICAART), Vol. 3, p. 608–616. Cited by: §1. [2] J. F. Banzhaf I (1965) Weighted voting doesn’t work: a mathematical analysis. Vol. 19, p. 317. Cited by: §2.2. [3] D. Bouchaffra, E. Koontz, V. Kripasundar, and R. K. Srihari (1996) Incorporating diverse information sources in handwriting recognition postprocessing. International Journal of Imaging Systems and Technology 7 (4), p. 320–329. Cited by: §8. [4] D. Bouchaffra, F. Ykhlef, B. Faye, M. Lebbah, and H. Azzag (2026-01) Game theory meets statistical physics: a novel deep neural networks design. IEEE Transactions on Cybernetics. External Links: Document Cited by: §1. [5] D. Bouchaffra, F. Ykhlef, B. Faye, M. Lebbah, and H. Azzag (2025) Redesigning deep neural networks: bridging game theory and statistical physics. Neural Networks 191, p. 107807. External Links: ISSN 0893-6080 Cited by: §1. [6] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, p. 632–642. Cited by: §7.1, §7.1, TABLE I. [7] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017-07) Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), R. Barzilay and M. Kan (Eds.), Vancouver, Canada, p. 1657–1668. Cited by: 2nd item. [8] K. Choromanski et al. (2021) Rethinking attention with performers. In International Conference on Learning Representations (ICLR), External Links: Link, Document, 2009.14794 Cited by: §1, §2.1. [9] M. Dehghani et al. (2023) Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, p. 7480–7512. Cited by: §1. [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 3rd item, 4th item. [11] F. Draye, A. Lei, H. Pan, I. Posner, and B. Schölkopf (2025) Sparse attention post-training for mechanistic interpretability. arXiv preprint arXiv:2512.05865. External Links: Document, Link, 2512.05865 Cited by: §2.1. [12] Y. Du, J. Meier, J. Ma, R. Fergus, and A. Rives (2020) Energy-based models for atomic-resolution protein conformations. In International Conference on Learning Representations (ICLR), External Links: Document, Link, 2004.13167 Cited by: §2.3. [13] P. Eleftheriadis, I. Perikos, and I. Hatzilygeroudis (2023) Evaluating deep learning techniques for natural language inference. Applied Sciences 13 (2), p. 785. External Links: Document Cited by: §7.2, §7.2, TABLE V, TABLE V. [14] B. Feng, Z. Lai, S. Li, Z. Wang, S. Wang, P. Huang, and M. Cao (2025) Breaking down video llm benchmarks: knowledge, spatial perception, or true temporal understanding?. In Evaluating the Evolving LLM Lifecycle Workshop, NeurIPS 2025, Note: NeurIPS 2025 Cited by: §1. [15] A. Ghorbani and J. Zou (2019) Data shapley: equitable valuation of data for machine learning. In International Conference on Machine Learning, p. 2242–2251. Cited by: §2.2. [16] A. Gu and T. Dao (2023) Mamba: linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752. Cited by: §1. [17] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning, p. 1321–1330. Cited by: §2.3. [18] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret (2020) Transformers are rnns: fast autoregressive transformers with linear attention. In International Conference on Machine Learning, Vol. 119, p. 5156–5165. Cited by: §1, §2.1. [19] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020) ALBERT: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. Cited by: 7th item, 8th item. [20] W. Liu, X. Wang, J. Owens, and Y. Li (2021) Energy-based out-of-distribution detection. In Advances in Neural Information Processing Systems, Vol. 34, p. 21464–21475. Cited by: §2.3. [21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: 5th item, 6th item. [22] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: 4th item. [23] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, Vol. 30, p. 4768–4777. Cited by: §2.2. [24] S. Luo, S. Li, T. Cai, D. He, D. Peng, S. Zheng, G. Ke, L. Wang, and T. Liu (2021) Stable, fast and accurate: kernelized attention with relative positional encoding. In Advances in Neural Information Processing Systems (NeurIPS), External Links: Document, Link, 2106.12566 Cited by: §2.3. [25] F. Lygerakis, O. Özdenizci, and E. Rückert (2025) ViTaPEs: visuotactile position encodings for cross-modal alignment in multimodal transformers. arXiv preprint arXiv:2505.20032. Cited by: §8. [26] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit (2016) A decomposable attention model for natural language inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 2249–2255. External Links: Document, Link, 1606.01933 Cited by: 1st item. [27] L. S. Shapley (1953) A value for n-person games. In Contributions to the Theory of Games, Vol. 2, p. 307–317. Cited by: §2.2. [28] L. Shu, J. Tang, X. Guan, and D. Zhang (2026-01) A comprehensive survey of genome language models in bioinformatics. Briefings in Bioinformatics 27 (1), p. p. bbaf724. Cited by: §1. [29] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In International Conference on Machine Learning, p. 3319–3328. Cited by: §2.2. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, p. 6000–6010. Cited by: §2.1. [31] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.3. [32] A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, p. 1112–1122. Cited by: §7.1, §7.1, TABLE I.