Paper deep dive
AS2 -- Attention-Based Soft Answer Sets: An End-to-End Differentiable Neuro-Soft-Symbolic Reasoning Architecture
Wael AbdAlmageed
Abstract
Abstract:Neuro-symbolic artificial intelligence (AI) systems typically couple a neural perception module to a discrete symbolic solver through a non-differentiable boundary, preventing constraint-satisfaction feedback from reaching the perception encoder during training. We introduce AS2 (Attention-Based Soft Answer Sets), a fully differentiable neuro-symbolic architecture that replaces the discrete solver with a soft, continuous approximation of the Answer Set Programming (ASP) immediate consequence operator $T_P$. AS2 maintains per-position probability distributions over a finite symbol domain throughout the forward pass and trains end-to-end by minimizing the fixed-point residual of a probabilistic lift of $T_P$, thereby differentiating through the constraint check without invoking an external solver at either training or inference time. The architecture is entirely free of conventional positional embeddings. Instead, it encodes problem structure through constraint-group membership embeddings that directly reflect the declarative ASP specification, making the model agnostic to arbitrary position indexing. On Visual Sudoku, AS2 achieves 99.89% cell accuracy and 100% constraint satisfaction (verified by Clingo) across 1,000 test boards, using a greedy constrained decoding procedure that requires no external solver. On MNIST Addition with $N \in \{2, 4, 8\}$ addends, AS2 achieves digit accuracy above 99.7% across all scales. These results demonstrate that a soft differentiable fixpoint operator, combined with constraint-aware attention and declarative constraint specification, can match or exceed pipeline and solver-based neuro-symbolic systems while maintaining full end-to-end differentiability.
Tags
Links
- Source: https://arxiv.org/abs/2603.18436v1
- Canonical: https://arxiv.org/abs/2603.18436v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
51,904 characters extracted from source content.
Expand or collapse full text
AS2 - Attention-Based Soft Answer Sets: An End-to-End Differentiable Neuro-Soft-Symbolic Reasoning Architecture Wael AbdAlmageed Clemson University wabdalm@clemson.edu Abstract Neuro-symbolic artificial intelligence (AI) systems typically couple a neural perception module to a discrete symbolic solver through a non-differentiable boundary, preventing constraint-satisfaction feedback from reaching the perception encoder during training. We introduce AS2 (Attention-Based Soft Answer Sets), a fully differentiable neuro-symbolic architecture that replaces the discrete solver with a soft, continuous approximation of the Answer Set Programming (ASP) immediate consequence operator TPT_P. AS2 maintains per-position probability distributions over a finite symbol domain throughout the forward pass and trains end-to-end by minimizing the fixed-point residual of a probabilistic lift of TPT_P, thereby differentiating through the constraint check without invoking an external solver at either training or inference time. The architecture is entirely free of conventional positional embeddings. Instead, it encodes problem structure through constraint-group membership embeddings that directly reflect the declarative ASP specification, making the model agnostic to arbitrary position indexing. On Visual Sudoku, AS2 achieves 99.89% cell accuracy and 100% constraint satisfaction (verified by Clingo) across 1,000 test boards, using a greedy constrained decoding procedure that requires no external solver. On MNIST Addition with N∈2,4,8N∈\2,4,8\ addends, AS2 achieves digit accuracy above 99.7% across all scales. These results demonstrate that a soft differentiable fixpoint operator, combined with constraint-aware attention and declarative constraint specification, can match or exceed pipeline and solver-based neuro-symbolic systems while maintaining full end-to-end differentiability. 1 Introduction The integration of neural perception and symbolic reasoning has received considerable attention in recent years (Garcez et al., 2019; Manhaeve et al., 2018; Li et al., 2023). However, neuro-symbolic pipelines often lack end-to-end differentiability, because of the boundary between perception and reasoning components, particularly with specified logical constraints. Perceptual constraint-satisfaction problems (P-CSPs) (Wang et al., 2019; Yang et al., 2020; Mulamba et al., 2024) require mapping raw sensor data (e.g., images of handwritten digits) to discrete symbolic assignments that collectively satisfy a set of declarative constraints (e.g., Sudoku rules, arithmetic identities, etc.). Neural models excel at extracting high-level representations from raw, high-dimensional inputs, while symbolic reasoners excel at enforcing global consistency through declarative constraints. The challenge of combining these two capabilities arises from the incompatibility of their computational registers, since neural inference operates over continuous distributions while symbolic reasoning requires discrete variable assignments that satisfy hard logical constraints. The dominant paradigm in neuro-symbolic artificial intelligence (AI) addresses P-CSPs through a pipeline architecture that decouples perception from reasoning, in which a neural network first processes inputs, then a discrete symbolic solver is invoked to find a globally consistent assignment. Answer Set Programming (ASP) (Brewka et al., 2011; Gelfond and Lifschitz, 1988) is the predominant symbolic formalism employed in such pipelines. ASP is a mature declarative logic-programming paradigm whose stable-model semantics support default negation, choice rules, and optimization statements, enabling compact specification of complex combinatorial constraints. Clingo (Gebser et al., 2014) is the most widely used ASP solver, while CP-SAT solvers (Mulamba et al., 2024) provide a propositional alternative when the constraint vocabulary does not require full ASP expressiveness. This decomposition of neuro-symbolic pipelines introduces a fundamental structural bottleneck, since the symbolic (i.e., discrete) handoff between the neural and symbolic components prevents (potentially useful) constraint-satisfaction gradients from flowing back into the perception modules. The perception module is therefore trained without any feedback about how its predictions affect downstream reasoning or how constraint/logical violations could improve learned representations. Therefore, a single confident perception error (e.g., misclassification) can render the symbolic grounding unsatisfiable. State of the art systems, such as NeurASP (Yang et al., 2020), DeepProbLog (Manhaeve et al., 2018), Scallop (Li et al., 2023), and PBCS (Mulamba et al., 2024), all exhibit this limitation to varying degrees. Furthermore, the disjoint nature of these pipelines prevents the reasoning component from taking advantage of the rich uncertainty encoded in the learned representations of the perception module, which could be used to guide search and inference of the symbolic solver. Alternative approaches attempt end-to-end differentiability by replacing the discrete solver with a differentiable relaxation. For example, SATNet (Wang et al., 2019) embeds a semi-definite programming (SDP)-relaxed MaxSAT layer as a neural module, and RRN (Palm et al., 2018) learns iterative message passing without any explicit symbolic component. However, these systems either lack formal constraint guarantees (e.g., RRN) or rely on relaxations that are not sound with respect to the original constraint semantics (e.g., SATNet), and they do not generalize to the expressive ASP constraint vocabulary described above. We introduce AS2 (Attention-Based Soft Answer Sets), a fully differentiable neuro-soft-symbolic architecture that addresses these limitations through three core design principles. First, rather than committing to discrete symbol assignments before reasoning, the model maintains continuous probability distributions over a finite symbolic vocabulary throughout the entire forward pass. Constraints are enforced through a differentiable loss derived from the fixed-point residual of a probabilistic lift of the ASP immediate consequence operator TPT_P (van Emden and Kowalski, 1976; Takemura and Inoue, 2024), enabling gradient flow from constraint satisfaction directly into the perception encoder. No external solver is invoked neither at training nor at inference time. Second, AS2 is entirely free of conventional positional embeddings. In contrast to recent work on neuro-symbolic Transformers that relies on learned or sinusoidal positional encodings to index tokens (McLeish et al., 2024), AS2 encodes problem structure exclusively through constraint-group membership embeddings that reflect the declarative ASP specification. Each token’s position is defined by which constraint groups in which it participates (e.g., rows, columns, and boxes for Sudoku), making the representation invariant to arbitrary permutations of the position index and directly grounded in the logical structure of the problem. Third, AS2 approximates a soft differentiable version of the fixpoint operator TPT_P of logic programming. The classical TPT_P operator maps interpretations to their immediate consequences under a logic program and converges to the unique minimal model through iterated application (van Emden and Kowalski, 1976). AS2 lifts this operator to probability distributions, replacing set-theoretic intersection with element-wise products and computing constraint violations as the squared distance between the current distribution and its image under TPT_P. The calculated loss has no degenerate minima at the uniform distribution (unlike naive penalty formulations) and is zero if and only if the predicted distribution is a valid one-hot assignment satisfying all constraints. This probabilistic fixpoint operator serves as a fully differentiable surrogate for classical ASP inference, enabling the model to jointly learn both perception and reasoning. It is critical to note that the full differentiability of AS2 does not reduce the architecture to an unconstrained learned model. The reasoning module is not an additional free-parameter layer that approximates logic through data. Rather, it is a structural translation of the ASP immediate-consequence operator TPT_P into continuous arithmetic, whose topology is fixed entirely by the declarative logic program and whose structure is not adjusted during training. The distinction between AS2 and a generic neural network, therefore, is not whether or not gradients are used, but what structure governs the reasoning computation, since in AS2 that structure is determined by the symbolic program and gradients flow through a reasoning pathway whose form is dictated by constraint semantics rather than learned from examples. We evaluate AS2 on two standard neuro-symbolic benchmarks. On Visual Sudoku (Wang et al., 2019; Yang et al., 2020), our primary benchmark, AS2 achieves 99.89% cell accuracy with the Transformer alone, and a greedy constrained decoding procedure (which requires no external solvers) raises the constraint satisfaction rate to 100%, verified independently by the Clingo ASP solver. This result outperforms the state of the art grid accuracy of 99.4% PBCS (Mulamba et al., 2024), which relies on a CP-SAT solver at inference time. On MNIST Addition (Manhaeve et al., 2018) with N∈2,4,8N∈\2,4,8\ addends, AS2 achieves digit accuracy above 99.7% across all scales, confirming that the architecture generalises beyond Latin-square constraints to arithmetic reasoning. The contributions of this paper are: • A fully differentiable neuro-soft-symbolic architecture that replaces discrete ASP solvers with a soft probabilistic TPT_P operator, enabling end-to-end training with constraint feedback. • Constraint-group membership embeddings as a positional-embedding-free mechanism for encoding problem structure in attention-based models. • Demonstrating that greedy constrained decoding, combined with the trained soft distributions, achieves perfect constraint satisfaction on Visual Sudoku without invoking any external solver. 2 Related Work We organize related work into three major constraint-solving families of methods that involve neural perception. The first family, pipeline neuro-symbolic systems, couples a neural perception module to a discrete symbolic solver via a non-differentiable boundary. The second family, end-to-end differentiable neuro-symbolic systems, replaces the discrete solver with a differentiable relaxation. The third family, perception-based constraint solving methods, supervises a neural perception module jointly with a classical solver via calibration feedback at inference time. 2.1 Pipeline Neuro-Symbolic Systems By far, the dominant paradigm in neuro-symbolic artificial intelligence (AI) couples a neural perception module with a classical symbolic solver via a discrete handoff/interface. NeurASP (Yang et al., 2020) wraps neural network outputs as probabilistic facts in an Answer Set Programming (ASP) program and calls Clingo at inference time. This coupling is shallow because the gradient does not flow from the solver back into the perception encoder. DeepProbLog (Manhaeve et al., 2018, 2021a) attaches neural predicates to a ProbLog engine and computes exact gradients through weighted model counting. However, it does not use stable-model semantics and does not scale beyond small programs. Scallop (Li et al., 2023) introduces differentiable Datalog with provenance semirings. Its constraint vocabulary is limited to stratified Datalog, which lacks default negation, choice rules, and optimization statements. LTN (Badreddine et al., 2022) grounds first-order logic formulae as differentiable real-valued functions, yet the fuzzy-logic relaxation discards crisp stable-model semantics, making formal verification intractable. All pipeline systems share a structural bottleneck studied systematically by Yang et al. (2020), in which constraint-satisfaction loss cannot flow back through the symbolic boundary, so the perception encoder is trained without feedback about how its outputs affect reasoning. 2.2 End-to-End Differentiable Neuro-Symbolic Systems SATNet (Wang et al., 2019) embeds a differentiable semi-definite programming (SDP)-relaxed MaxSAT layer as a neural module, enabling end-to-end training. However, the SDP relaxation may certify infeasible instances as satisfiable, and its O(n3)O(n^3) complexity prevents scaling to large constraint systems. RRN (Palm et al., 2018) learns iterative message passing on a fully-connected cell graph, solving Sudoku without any explicit symbolic component. It serves as a strong neural baseline, yet offers no formal constraint guarantees and does not generalize to new constraint families. A-NeSI (van Krieken et al., 2023) addresses the intractability of exact inference in DeepProbLog by introducing approximate neuro-symbolic inference, at some cost to the tightness of gradient estimates. SLASH (Skryagin et al., 2022) combines probabilistic circuits with neural ASP for tractable inference via sum-product networks, yet the coupling still relies on a classical solver at test time. The work most closely related to ours is Takemura and Inoue (2024), who propose differentiable constraint learning via the immediate-consequence operator for neural ASP. Their formulation provides the probabilistic lift of TPT_P that we adopt as the foundation for our constraint loss. However, Takemura and Inoue (2024) do not address the architectural question of how to propagate constraint information across positions without positional embeddings, do not incorporate inference-time constraint propagation, and do not evaluate on perceptual benchmarks where raw images must be jointly classified and reasoned about. AS2 builds on their TPT_P formulation and extends it with constraint-group membership embeddings, a multi-layer Transformer reasoning module, and greedy constrained decoding, yielding a complete end-to-end system for perceptual constraint-satisfaction problems. 2.3 Perception-Based Constraint Solving Mulamba et al. (2024) introduce a predict-then-optimize framework (which we abbreviate PBCS) that supervises a Cell-CNN digit classifier jointly with a CP-SAT solver via calibration feedback at inference time. On MNIST-based Visual Sudoku, PBCS achieves 99.4% grid accuracy, by exploiting the fact that the distribution over digit labels, when properly calibrated, provides enough soft evidence for CP-SAT to back-propagate constraint violations into the perception loss via a no-good-aware surrogate. However, PBCS has three structural limitations of the pipeline paradigm. First, no gradient flows directly from the constraint solver through the discrete handoff into the perception encoder during training. The no-good surrogate is a post-hoc approximation of the true constraint signal. Second, the constraint vocabulary is limited to AllDifferent; extending to recursive rules, default negation, or optimization statements would require a new solver entirely. Third, PBCS applies only when a complete, sound CP model exists for the target task. There is no path to open-world or defeasible reasoning. AS2 addresses all three gaps. Gradients flow end-to-end through the differentiable TPT_P residual loss, and the constraint vocabulary is the full syntax of Answer Set Programming. Stable-model semantics support open-world and default reasoning natively. 2.4 Positional Encodings in Neuro-Symbolic Transformers Standard Transformer architectures (Vaswani et al., 2017) require positional encodings to break the permutation symmetry of self-attention. Recent work on applying Transformers to constraint-satisfaction problems has adopted either sinusoidal or learned positional embeddings to index cells or tokens. These encodings impose an arbitrary ordering on the problem variables, and the model must learn to disentangle positional information from constraint structure. AS2 takes a fundamentally different approach, replacing conventional positional embeddings entirely with constraint-group membership embeddings, where each token’s representation is enriched by the sum of learned embeddings indexed by the constraint groups to which the token belongs (e.g., row, column, and box for Sudoku). This design directly encodes the logical structure of the problem into the token representations and makes the model invariant to any re-indexing of positions that preserves constraint-group membership. To our knowledge, AS2 is the first neuro-symbolic Transformer that achieves competitive results on Visual Sudoku without any form of conventional positional encoding. 3 Method Figure 1: AS2 architecture. Raw perceptual signals are encoded cell-by-cell by a shared-weight convolutional neural network (CNN) (§3.4). A concept bottleneck head produces per-position pre-reasoning logits; clue cells are evidence-clamped before the soft distributions are projected to ℝdmodelR^d_model and enriched by constraint-group membership embeddings (row, column, and box for Sudoku), replacing conventional positional encodings. A multi-layer Transformer encoder propagates information across all positions jointly. The output head produces post-reasoning logits supervised end-to-end by the differentiable TPT_P fixed-point loss (Equation 3), which enforces constraint satisfaction without calling an external solver. 3.1 Problem Formulation We address the class of perceptual constraint-satisfaction problems (P-CSPs), in which a model observes n raw perceptual signals =(X1,…,Xn)X=(X_1,…,X_n) and must assign each position i a symbol from a finite domain Σ=s1,…,sk =\s_1,…,s_k\, subject to a set of declarative integrity constraints C specified as a logic program P over ground atoms drawn from a Herbrand structure Σ . A binary evidence mask ∈0,1nm∈\0,1\^n identifies positions whose assignments are directly observable (mi=1m_i=1, evidence) versus positions that must be inferred (mi=0m_i=0, latent). The ground-truth assignment is ∈Σny∈ ^n and the task is to predict ^∈Σn y∈ ^n that (i) agrees with all evidence positions and (i) is an answer set of P, satisfying every constraint in C. As illustrated in Fig.˜1, AS2 addresses limitations of classic neuro-symbolic architectures that invoke discrete solvers at inference time by keeping representations soft and continuous throughout. Instead of committing to discrete symbols, the model maintains per-position probability distributions ∈[0,1]n×kp∈[0,1]^n× k over Σ and differentiates through the constraint check using a probabilistic lift of the immediate-consequence operator described below. The solver is replaced by a differentiable loss that trains the model to produce distributions that are simultaneously consistent with observations and consistent with C. At inference, the predicted assignment is recovered by y^i=argmaxspi,s y_i=argmax_sp_i,s. No external solver is called. 3.2 Probabilistic Lift of the Immediate-Consequence Operator Let P be a definite logic program and TPT_P its immediate-consequence operator (van Emden and Kowalski, 1976), the function that maps a Herbrand interpretation I to the set of atoms whose bodies are satisfied under I. The least fixed point lfp(TP)lfp(T_P) equals the unique minimal Herbrand model of P. A ground atom a is a logical consequence of P if and only if a∈lfp(TP)a (T_P). The integrity constraints in C take the form :−ϕ :\!-\,φ, which are violated whenever ϕφ holds under the predicted assignment. Each such constraint induces a constraint group ⊆1,…,nG \1,…,n\, a subset of positions whose assignments must be mutually exclusive with respect to some symbol-level condition. We lift TPT_P Takemura and Inoue (2024) to probability distributions as follows. Given per-position distributions ∈[0,1]n×kp∈[0,1]^n× k (the output of softmax layers), the probabilistic TPT_P operator applied to group G is defined position- and symbol-wise as shown in Eq.˜1: TP()i,s=pi,s⋅∏j∈,j≠i(1−pj,s),i∈,s∈Σ.T_P(p)_i,s\;=\;p_i,s· _j ,\,j≠ i (1-p_j,s ), i ,\;s∈ . (1) Intuitively, position i can claim symbol s only if it already assigns probability mass to s and no competing position in G does so. The fixed-point condition =TP()p=T_P(p) holds if and only if p is a valid one-hot assignment in which every group constraint is satisfied. For the MNIST Addition task, on the other hand, the constraint is global arithmetic consistency, requiring the sum of the predicted digits to equal the observed total. We implement the TPT_P operator for arithmetic constraints using polynomial convolution. Given N addend distributions 1,…,N∈[0,1]10p_1,…,p_N∈[0,1]^10 and a target sum S, the leave-one-out sum distribution for addend i is computed via prefix-suffix convolutions, as shown in Eq.˜2: TP()i,d=pi,d⋅P(∑j≠idj=S−d),T_P(p)_i,d\;=\;p_i,d· P ( _j≠ id_j=S-d ), (2) where P(∑j≠idj=S−d)P( _j≠ id_j=S-d) is obtained by convolving the probability vectors of all addends except i. This avoids the combinatorial explosion of enumerating all digit tuples and is computed efficiently in O(N⋅k2)O(N· k^2) using forward and backward prefix convolutions implemented as grouped 1D convolutions. 3.3 The TPT_P Fixed-Point Loss Naive constraint penalties for exclusivity, such as the squared-sum loss (∑ipi,s−1)2( _ip_i,s-1)^2 (Wang et al., 2019), have a degenerate global minimum at the uniform distribution pi,s=1/kp_i,s=1/k, which satisfies no actual constraint. The AS2 constraint loss avoids this pathology by directly minimising the fixed-point residual of TPT_P across all constraint groups, as defined in Eq.˜3: ℒic()=∑∈∑i∈∑s∈Σ(pi,s−TP()i,s)2.L_ic(p)\;=\; _G _i _s∈ (p_i,s-T_P(p)_i,s )^2. (3) At the uniform distribution, TP()i,s=(1/k)⋅((k−1)/k)k−1≠1/kT_P(p)_i,s=(1/k)·((k-1)/k)^k-1≠ 1/k for any k≥2k≥ 2, so ℒic>0L_ic>0. Therefore, the degenerate minimiser of the squared-sum loss is not a minimiser of ℒicL_ic. Conversely, at any valid one-hot assignment the product in Equation (1) collapses to pi,s⋅1||−1=pi,sp_i,s· 1^|G|-1=p_i,s, giving ℒic=0L_ic=0. The entire computation is differentiable and fully vectorised in log-space for numerical stability (Takemura and Inoue, 2024). No solver is invoked during either training or inference. 3.4 Architecture AS2 has three coupled modules that together implement a soft, latent-space path from raw perceptual inputs to a constraint-satisfying assignment, without ever converting representations to discrete symbols before the final decoding step. Perception Module. The perception module maps each raw perceptual signal XiX_i to a continuous embedding i∈ℝdmodelh_i ^d_model. For Visual Sudoku, we use a shared-weight three-block CNN (32 → 64 → 128 channels with max-pooling and LayerNorm) over 28×2828×28 cell images. For MNIST Addition, the same CNN backbone is applied independently to each addend digit image. A concept bottleneck head (Koh et al., 2020), a single linear layer, projects ih_i to pre-reasoning logits ipre∈ℝkz_i^pre ^k, providing an interpretable interface between perception and reasoning. Evidence Clamping. Before the reasoning module, evidence clamping replaces the soft distribution of observed positions with the one-hot ground-truth label (only during training) or the perception module’s argmaxargmax prediction (during inference), anchoring the latent representations at known positions. The resulting k-dimensional soft distributions are projected back to dmodeld_model for input to the reasoning module. Constraint-Group Membership Embeddings (Positional-Embedding-Free Design). Rather than using conventional positional embeddings (e.g., sinusoidal) to index tokens by their sequential position, AS2 encodes problem structure through constraint-group membership embeddings. Each token’s representation is enriched by the element-wise sum of learned embedding vectors indexed by the constraint groups to which the token belongs. For Visual Sudoku, three embedding tables are defined, one for each of the nine rows, nine columns, and nine 3×33×3 boxes, yielding a positional representation i=irow+icol+iboxe_i=e_i^row+e_i^col+e_i^box. This design directly encodes the logical structure of the constraint program into the token representations, such that two cells share a high degree of positional similarity if and only if they participate in the same constraint groups. The model is therefore invariant to any re-indexing of positions that preserves constraint-group membership, and the embeddings do not impose any arbitrary sequential ordering on the problem variables. For MNIST Addition, a simple learned addend-index embedding ipose^pos_i (i=1,…,Ni=1,…,N) is summed into each token. Reasoning Module. The n tokens (enriched with constraint-group membership embeddings) are processed by a multi-layer Transformer encoder (Vaswani et al., 2017) with pre-norm (LayerNorm before attention and feedforward), GELU activation, and full self-attention. Self-attention allows each position to gather evidence from every other position simultaneously, enabling the latent representations to collectively converge toward a globally consistent assignment without a single discrete commitment. The output head projects each token to post-reasoning logits ipost∈ℝkz_i^post ^k. For Visual Sudoku, the reasoning module uses L=6L=6 Transformer layers, H=8H=8 attention heads, dmodel=256d_model=256, and dff=4dmodeld_f=4d_model. For MNIST Addition (N=2,4N=2,4), we use L=3L=3 layers, H=4H=4 heads, and dmodel=128d_model=128. For N=8N=8, we increase to L=5L=5 layers to handle the harder multi-digit task. In the MNIST Addition variant, the Transformer output tokens are mean-pooled and projected to the number of possible sum classes (10N−N+110N-N+1). Hard Clue Restoration. At inference time on Visual Sudoku, a final post-processing step replaces the Transformer output at evidence (clue) cells with the perception module’s pre-reasoning logits, guaranteeing that clue-cell predictions exactly match the CNN’s output. This architectural choice separates the evaluation of perception quality (measured at clue cells) from reasoning quality (measured at blank cells). 3.5 Declarative Constraint Specification Constraints in AS2 is that are declaratively specified in standard Answer Set Programming (ASP) syntax (Brewka et al., 2011; Gebser et al., 2014), using the same notation used for Clingo programming. The user specifies what constitutes a valid assignment. The framework automatically compiles each integrity constraint :−ϕ :\!-\,φ into the corresponding set of TPT_P residual terms in Equation (3) via the probabilistic lifting of Takemura and Inoue (2024), with no manual penalty engineering. For Visual Sudoku, the complete constraint specification is shown in Fig.˜2. The three integrity constraints induce 27 constraint groups (9 rows ++ 9 columns ++ 9 boxes), each generating 81 TPT_P residual terms, for a total of 2,187 differentiable constraint-satisfaction checks per forward pass, all derived automatically from the declarative specification. ⬇ :- R=1..9, V=1..9, #countC : cell(R,C,V) != 1. % rows :- C=1..9, V=1..9, #countR : cell(R,C,V) != 1. % columns :- BR=0..2, BC=0..2, V=1..9, #countR,C : cell(R,C,V), R>=BR*3+1, R<=BR*3+3, C>=BC*3+1, C<=BC*3+3 != 1. % boxes Figure 2: Complete declarative constraint specification for Visual Sudoku in Answer Set Programming (ASP) syntax. Each integrity constraint enforces that every digit appears exactly once in each row, column, and 3×33×3 box, respectively. These three rules are compiled automatically into the differentiable TPT_P residual loss with no manual penalty engineering. 3.6 Training Objective and Curriculum Visual Sudoku. The training loss for the Sudoku instantiation combines four terms, as shown in Eq.˜4: ℒ=ℒcnn+ℒsup+λblank⋅α(t)⋅ℒblank+λic⋅β(t)⋅ℒic,L\;=\;L_cnn+L_sup+ _blank·α(t)·L_blank+ _ic·β(t)·L_ic, (4) where ℒcnnL_cnn is cross-entropy on prez^pre at evidence positions (trains the perception module directly), ℒsupL_sup is cross-entropy on postz^post at evidence positions (grounds the reasoning module at known anchors), ℒblankL_blank is cross-entropy on postz^post at latent positions against ground-truth labels (provides direct supervision where no clue is available), and ℒicL_ic is the TPT_P fixed-point loss (Eq. 3). Two curriculum schedules govern the strength of the latter two terms. α(t)=max(0.1, 1−0.9t/Tdecay)α(t)= (0.1,\;1-0.9\,t/T_decay) decays latent-position cross-entropy from full weight to a 0.1×0.1× residual floor over Tdecay=100T_decay=100 epochs, preventing the model from fitting symbol identities at latent positions before it has learned to propagate constraint information. β(t)=min(1,t/Twarm)β(t)= (1,\;t/T_warm) linearly ramps the constraint loss from zero over Twarm=20T_warm=20 warmup epochs, stabilising early training before the TPT_P gradient becomes dominant. Hyperparameters are λblank=0.3 _blank=0.3 and λic=5.0 _ic=5.0. The CNN backbone is frozen for the first 20 epochs, then unfrozen to allow constraint feedback to influence perception. MNIST Addition. The training loss for MNIST Addition is given in Eq.˜5: ℒ=ℒdigitpre+λdigit⋅α′(t)⋅ℒdigitpost+ℒsum+λic⋅β′(t)⋅ℒic,L\;=\;L_digit^pre+ _digit·α (t)·L_digit^post+L_sum+ _ic·β (t)·L_ic, (5) where ℒdigitpreL_digit^pre and ℒdigitpostL_digit^post are cross-entropy on pre- and post-reasoning digit logits, ℒsumL_sum is cross-entropy on the sum prediction, and ℒicL_ic is the arithmetic TPT_P loss (Eq. 2). Direct digit supervision decays as α′(t)=max(0.1, 1−0.9t/50)α (t)= (0.1,\;1-0.9\,t/50), and the constraint loss ramps as β′(t)=min(1,t/10)β (t)= (1,\;t/10). 3.7 Inference-Time Constraint Propagation Although the TPT_P constraint loss trains the model to produce distributions that approach fixed points of the immediate-consequence operator, the argmaxargmax of the post-training softmax output is not guaranteed to satisfy all constraints exactly. Two complementary inference-time strategies close the remaining gap without invoking an external solver. Iterative TPT_P Refinement. Before taking the argmaxargmax, the soft distributions p are refined by iterating the TPT_P operator K times. At each step, the row, column, and box TPT_P values are computed independently, averaged element-wise, and re-normalised to a valid probability distribution, as shown in Eq.˜6: i,s(t+1)=1|i|∑∋iTP((t))i,s∑s′1|i|∑∋iTP((t))i,s′,p^(t+1)_i,s\;=\; 1|C_i| _G iT_P(p^(t))_i,s _s 1|C_i| _G iT_P(p^(t))_i,s , (6) where iC_i denotes the set of constraint groups containing position i and normalisation is applied per-position. The iteration propagates exclusivity information from neighbouring cells without discrete commitment and requires no gradient computation. On the Visual Sudoku test set, K=10K=10 iterations raise the raw constraint satisfaction rate from 95.6% to 98.7% with no retraining. Greedy Constrained Decoding. AS2 also employs a final discrete decoding step via a confidence-ordered greedy decoding procedure that requires no external solver. Evidence cells (identified by the mask m) are locked first, each assigned the argmaxargmax of the perception module’s distribution. The assigned digit is then hard-zeroed from all cells sharing a constraint group with the locked cell, and the remaining distributions are renormalized. The procedure then iterates over latent cells in decreasing order of max-probability. At each step, the most confident unassigned cell is committed to its top remaining digit and its assignment is propagated to all peers by the same masking-and-renormalization step. 4 Experimental Evaluation We evaluate AS2 on two standard neuro-symbolic benchmarks that span different constraint types and input modalities. All experiments for AS2 and its ablations are run with three random seeds. We report mean ± standard deviation where available. Results for prior work are taken from the respective original publications. 4.1 Benchmarks MNIST Addition (MNISTAdd). Introduced by Manhaeve et al. (2018) as the canonical benchmark for neuro-symbolic systems, MNISTAdd presents a model with N MNIST digit images (LeCun et al., 1998) and requires it to predict both the individual digit labels and their sum. Correct inference jointly requires (i) accurate digit recognition and (i) adherence to arithmetic constraints. We evaluate three scales: N=2N=2 (sum range [0,18][0,18], 19 classes), N=4N=4 (sum range [0,36][0,36], 37 classes), and N=8N=8 (sum range [0,72][0,72], 73 classes), each with 30,000 training and 5,000 test pairs. This benchmark is well-saturated in the literature, meaning most methods achieve near-perfect accuracy. We include it primarily to confirm that AS2 is competitive on a standard perception task and to study scaling behaviour as N increases. Visual Sudoku (Visual Sudoku). A 9×99×9 Sudoku solution is rendered as a 252×252252×252 grayscale image by placing MNIST digit images in clue cells and leaving non-clue cells blank (Wang et al., 2019; Yang et al., 2020). A model receives the full board image and must predict all 81 cell values while satisfying three families of Sudoku constraints, requiring that each row, each column, and each 3×33×3 box contains the digits 1–9 exactly once. We use 9,000 / 1,000 / 1,000 boards for train / validation / test with approximately 45 clue cells per board. This is our primary benchmark, as it requires genuine constraint reasoning over 27 interacting constraint groups and cannot be solved by perception alone. Table 1: Dataset statistics. Constraint type classifies the symbolic constraint family. #Groups lists the number of constraint groups per instance. Benchmark Train Val Test Input size Constraint type #Groups MNISTAdd (N=2N=2) 30,000 N/A 5,000 2×28×282×28×28 Arithmetic 1 MNISTAdd (N=4N=4) 30,000 N/A 5,000 4×28×284×28×28 Arithmetic 1 MNISTAdd (N=8N=8) 30,000 N/A 5,000 8×28×288×28×28 Arithmetic 1 Visual Sudoku 9,000 1,000 1,000 252×252252×252 Latin square 27 4.2 Baselines The following are the published neuro-symbolic systems that have reported results on at least one of our benchmarks, grouped by paradigm, plus two ablation variants of our own system. • DeepProbLog (Manhaeve et al., 2018, 2021a) integrates neural networks into ProbLog with exact gradients via weighted model counting. • NeurASP (Yang et al., 2020) wraps neural outputs as probabilistic facts in an Answer Set Programming (ASP) program. Clingo is called at inference. • Scallop (Li et al., 2023) implements differentiable Datalog with provenance semirings. • A-NeSI (van Krieken et al., 2023) provides approximate scalable inference for probabilistic neuro-symbolic programs. • SATNet (Wang et al., 2019) embeds a differentiable SDP-relaxed MaxSAT layer. • RRN (Palm et al., 2018) learns iterative message passing on a relational graph. • PBCS (Mulamba et al., 2024) applies perception-based CP-SAT constraint solving (evaluated on Visual Sudoku only). Ablations of AS2. • Perception-Only uses the CellCNN backbone only, with no reasoning module, trained with cross-entropy loss. This tests the contribution of symbolic reasoning. • Pipeline uses the CellCNN trained with cross-entropy loss, with symbolic constraint checking applied post hoc at inference time. This tests whether end-to-end training with constraint feedback provides an advantage over pipeline-style decoupling. 4.3 Metrics For MNIST Addition, we report Digit-Acc (fraction of individual digits predicted correctly) and Sum-Acc (fraction of instances where the predicted sum is correct). For Visual Sudoku, we report Sym-Acc (fraction of individual cells correct), Board-Acc (fraction of boards where all 81 cells are correct), CSR (fraction of boards satisfying all row, column, and box constraints, computed algebraically), and VCSR (same, verified independently by Clingo (Gebser et al., 2014)). 4.4 Results Table 2: Results on MNISTAdd. Digit-Acc = individual digit accuracy (%); Sum-Acc = sum accuracy (%). Results for prior work are taken from the respective publications. Bold = best in column. “† ” indicates the result uses only sum-level supervision (no per-digit labels); Digit-Acc is not reported for these methods because per-digit predictions are implicit. DeepProbLog and A-NeSI results are taken from van Krieken et al. (2023). NeurASP result at N=2N=2 is taken from Manhaeve et al. (2021b). Scallop results are reported by Li et al. (2023) (best variant). All AS2 and ablation results use both digit and sum supervision. N=2N=2 N=4N=4 N=8N=8 Method Digit-Acc ↑ Sum-Acc ↑ Digit-Acc ↑ Sum-Acc ↑ Digit-Acc ↑ Sum-Acc ↑ DeepProbLog † (Manhaeve et al., 2021a) N/A 97.20± 0.50 N/A N/A N/A N/A NeurASP † (Yang et al., 2020) N/A 97.30± 0.30 N/A N/A N/A N/A Scallop † (Li et al., 2023) N/A 98.22 N/A 97.00 N/A N/A A-NeSI † (van Krieken et al., 2023) N/A 97.66± 0.21 N/A 95.96± 0.38 N/A N/A Perception-Only 99.69± 0.05 99.37± 0.10 99.91 99.64 99.98 99.86± 0.03 Pipeline 99.63± 0.01 99.26± 0.02 99.89± 0.03 99.57± 0.13 99.98± 0.00 99.89± 0.03 AS2 (ours) 99.73± 0.03 99.45± 0.08 99.87 99.44 99.95± 0.01 99.01± 0.18 MNIST Addition. Table 2 presents results on MNISTAdd at three scales. At N=2N=2, AS2 achieves 99.73± 0.03% digit accuracy and 99.45± 0.08% sum accuracy. Prior neuro-symbolic systems (DeepProbLog, NeurASP, Scallop, A-NeSI) use only sum-level supervision (no per-digit labels), yet AS2 with both digit and sum supervision achieves competitive or superior results. As N increases to 4 and 8, the benchmark becomes saturated for all three model variants (all above 99.8% digit accuracy). At N=8N=8, the Perception-Only and Pipeline baselines achieve slightly higher sum accuracy than AS2 (99.86% and 99.89% vs. 99.01± 0.18%). This gap is explained by the architectural overhead of the TPT_P constraint loss at N=8N=8, since the arithmetic TPT_P operator requires leave-one-out convolutions over 8 addends and the constraint loss gradient introduces optimization pressure that competes with direct digit and sum supervision. Nevertheless, the digit accuracy of AS2 remains above 99.9% at all scales, confirming that the soft constraint mechanism does not degrade perception quality. In particular, MNIST Addition is insufficiently discriminative to differentiate between methods at these scales. We include it for completeness and comparability with prior work. Table 3: Results on Visual Sudoku (our primary benchmark). Bold = best in column; underline = second-best. §Perception-Only and Pipeline use Clingo to complete blank cells at inference; their CSR of 100% reflects solver post-processing, not a learned constraint mechanism. Board-Acc of 0% indicates the solver produces a valid but incorrect completion. †PBCS uses a pipeline architecture with CP-SAT at inference and does not publish Sym-Acc, CSR, or VCSR. Its Board-Acc of 99.4% is taken from Mulamba et al. (2024). ‡AS2 CSR and VCSR are measured after greedy constrained decoding (Section 3.7). No external solver is invoked during inference. “Raw” Board-Acc (argmax only, no post-processing) is 95.60%. Method Sym-Acc ↑ Board-Acc ↑ CSR ↑ VCSR ↑ Perception-Only § 60.51 0.00 100.00 N/A Pipeline § 60.51 0.00 100.00 N/A PBCS † (Mulamba et al., 2024) N/A 99.4 N/A N/A AS2 (ours)‡ 99.89 100.0 100.0 100.0 Visual Sudoku. Table 3 presents results on our primary benchmark. AS2 achieves a cell accuracy (Sym-Acc) of 99.89%, reflecting both 100% accuracy on clue cells (guaranteed by hard clue restoration) and near-perfect accuracy on blank cells where the Transformer must infer the correct digit from constraint propagation alone. The unaided board accuracy (raw argmax with no post-processing) is 95.60%, reflecting residual constraint-satisfaction failures concentrated in blank cells where the Transformer has not fully converged to a valid fixed point. Iterative TPT_P refinement with K=10K=10 steps (Eq. 6) raises the constraint satisfaction rate to 98.7%, demonstrating that the trained distributions already encode sufficient constraint information to recover approximately 68% of the argmax failures through soft propagation alone. Greedy constrained decoding (Section 3.7) closes the remaining gap entirely, achieving a CSR of 100.0% and a Clingo-verified VCSR of 100.0% across all 1,000 test boards. Neither procedure calls an external solver. Among prior work, PBCS (Mulamba et al., 2024) reports the highest published board accuracy of 99.4% on Visual Sudoku by coupling a perception module with a full CP-SAT solver at inference time. However, it does not report constraint satisfaction metrics and requires a complete constraint model at test time. SATNet (Wang et al., 2019), NeurASP (Yang et al., 2020), and RRN (Palm et al., 2018) have each been evaluated on related Sudoku tasks, yet direct comparison is complicated by differences in data splits and evaluation protocols. AS2 achieves 100% board accuracy and 100% constraint satisfaction without any external solver, surpassing all previously reported results. Analysis. The Visual Sudoku results demonstrate the central claim of this paper, namely that a soft differentiable approximation of the ASP immediate-consequence operator, combined with constraint-aware attention (via constraint-group membership embeddings) and greedy constrained decoding, can achieve perfect constraint satisfaction on a non-trivial combinatorial reasoning task without invoking an external solver. The algebraic CSR and Clingo-verified VCSR agree exactly, validating the correctness of the algebraic constraint evaluator. The MNIST Addition results confirm that AS2 scales to arithmetic constraints and achieves competitive results across N∈2,4,8N∈\2,4,8\ addends. The slight degradation in sum accuracy at N=8N=8 relative to the baselines (which do not use a constraint loss) is an expected consequence of the additional optimization objective, and does not affect the architecture’s utility on problems where constraints genuinely matter (as Visual Sudoku demonstrates). 5 Conclusion We introduced AS2 (Attention-Based Soft Answer Sets), a fully differentiable neuro-symbolic architecture that replaces discrete Answer Set Programming (ASP) solvers with a soft probabilistic lift of the immediate-consequence operator TPT_P. Three design principles distinguish AS2 from prior neuro-symbolic systems. First, the architecture is entirely soft, maintaining per-position probability distributions over the symbol domain throughout the forward pass and enforcing constraints via a differentiable fixed-point residual loss that enables end-to-end gradient flow from constraint satisfaction into the perception encoder. Second, the architecture is free of conventional positional embeddings; problem structure is encoded exclusively through constraint-group membership embeddings derived from the declarative ASP specification. Third, the model approximates a soft differentiable version of the classical fixpoint operator of logic programming, providing a principled surrogate for ASP inference that avoids the degenerate minima of naive penalty formulations. A natural concern that may arise about any fully differentiable neuro-symbolic architecture is whether it retains genuine symbolic character once discrete inference is replaced by continuous computation. We strongly believe that relevant criterion is not whether gradients are used, but rather what structure governs the reasoning computation. In a generic neural network, the reasoning transformation is parameterized and learned entirely from data, with no prior commitment to its form. In AS2, on the other hand, the reasoning layer is a structural translation of the ASP immediate-consequence operator TPT_P, whose topology is fixed by the declarative logic program and whose form is not adjusted during training. Symbolic structure therefore survives relaxation in AS2, because the governing computation encodes the constraint semantics of the program rather than approximating them empirically. References S. Badreddine, A. d. Garcez, L. Serafini, and M. Spranger (2022) Logic tensor networks. Artificial Intelligence 303, p. 103649. External Links: Document, Link, 2012.13635 Cited by: §2.1. G. Brewka, T. Eiter, and M. Truszczyński (2011) Answer set programming at a glance. Communications of the ACM 54 (12), p. 92–103. External Links: Document, Link Cited by: §1, §3.5. A. S. d. Garcez, M. Gori, L. C. Lamb, L. Serafini, M. Spranger, and S. N. Tran (2019) Neural-symbolic computing: an effective methodology for principled integration of machine learning and reasoning. FLAP 6 (4), p. 611–632. External Links: Link, 1905.06088 Cited by: §1. M. Gebser, R. Kaminski, B. Kaufmann, and T. Schaub (2014) Clingo = ASP + control: preliminary report. In Technical Communications of the Thirtieth International Conference on Logic Programming, External Links: Link, 1405.3694 Cited by: §1, §3.5, §4.3. M. Gelfond and V. Lifschitz (1988) The stable model semantics for logic programming. Proceedings of the 5th International Conference and Symposium on Logic Programming, p. 1070–1080. External Links: Link Cited by: §1. P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang (2020) Concept bottleneck models. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Proceedings of Machine Learning Research, Vol. 119, p. 5338–5348. External Links: Link, 2007.04612 Cited by: §3.4. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), p. 2278–2324. External Links: Document, Link Cited by: §4.1. Z. Li, J. Huang, and M. Naik (2023) Scallop: A language for neurosymbolic programming. Proc. ACM Program. Lang. 7 (PLDI), p. 1463–1487. External Links: Document, Link Cited by: §1, §1, §2.1, 3rd item, Table 2, Table 2, Table 2. R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, and L. De Raedt (2018) DeepProbLog: neural probabilistic logic programming. In Advances in Neural Information Processing Systems, Vol. 31, p. 3753–3763. External Links: Link, 1805.10872 Cited by: §1, §1, §1, §2.1, 1st item, §4.1. R. Manhaeve, S. Dumančić, A. Kimmig, T. Demeester, and L. De Raedt (2021a) Neural probabilistic logic programming in DeepProbLog. Artificial Intelligence 298, p. 103504. External Links: Document, Link Cited by: §2.1, 1st item, Table 2. R. Manhaeve, G. Marra, and L. De Raedt (2021b) Approximate inference for neural probabilistic logic programming. In Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning, p. 475–486. External Links: Document, Link Cited by: Table 2, Table 2. S. McLeish, A. Bansal, A. Stein, N. Jain, J. Kirchenbauer, B. R. Bartoldson, B. Kailkhura, A. Bhatele, J. Geiping, A. Schwarzschild, and T. Goldstein (2024) Transformers can do arithmetic with the right embeddings. In Advances in Neural Information Processing Systems, Vol. 37. External Links: Link, 2405.17399 Cited by: §1. M. Mulamba, J. Mandi, A. I. Mahmutoğulları, and T. Guns (2024) Perception-based constraint solving for Sudoku images. Constraints 29 (1–2), p. 112–151. External Links: Document, Link Cited by: §1, §1, §1, §1, §2.3, 7th item, §4.4, Table 3, Table 3, Table 3. R. Palm, U. Paquet, and O. Winther (2018) Recurrent relational networks. In Advances in Neural Information Processing Systems, Vol. 31, p. 3372–3382. External Links: Link, 1711.08028 Cited by: §1, §2.2, 6th item, §4.4. A. Skryagin, W. Stammer, D. S. Dhami, and K. Kersting (2022) SLASH: embracing probabilistic circuits into neural answer set programming. In Proceedings of the Workshop on Tractable Probabilistic Modeling at UAI, External Links: Link, 2110.03395 Cited by: §2.2. A. Takemura and K. Inoue (2024) Differentiable logic programming for distant supervision. In Proceedings of the 27th European Conference on Artificial Intelligence (ECAI), Frontiers in Artificial Intelligence and Applications, Vol. 392, p. 1301–1308. External Links: Document, Link, 2408.12591 Cited by: §1, §2.2, §3.2, §3.3, §3.5. M. H. van Emden and R. A. Kowalski (1976) The semantics of predicate logic as a programming language. Journal of the ACM 23 (4), p. 733–742. External Links: Document, Link Cited by: §1, §1, §3.2. E. van Krieken, T. Thanapalasingam, J. M. Tomczak, F. van Harmelen, and A. ten Teije (2023) A-NeSI: a scalable approximate method for probabilistic neurosymbolic inference. In Advances in Neural Information Processing Systems, Vol. 36. External Links: Link, 2212.12393 Cited by: §2.2, 4th item, Table 2, Table 2, Table 2. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, p. 5998–6008. External Links: Link, 1706.03762 Cited by: §2.4, §3.4. P. Wang, P. Donti, B. Wilder, and Z. Kolter (2019) SATNet: bridging deep learning and logical reasoning using a differentiable satisfiability solver. In Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Proceedings of Machine Learning Research, Vol. 97, p. 6545–6554. External Links: Link, 1905.12149 Cited by: §1, §1, §1, §2.2, §3.3, 5th item, §4.1, §4.4. Z. Yang, A. Ishay, and J. Lee (2020) NeurASP: embracing neural networks into answer set programming. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, p. 1755–1762. External Links: Document, Link Cited by: §1, §1, §1, §2.1, 2nd item, §4.1, §4.4, Table 2.