← Back to papers

Paper deep dive

A Synthesizable RTL Implementation of Predictive Coding Networks

Timothy Oh

Year: 2026Venue: arXiv preprintArea: cs.NEType: PreprintEmbeddings: 36

Abstract

Abstract:Backpropagation has enabled modern deep learning but is difficult to realize as an online, fully distributed hardware learning system due to global error propagation, phase separation, and heavy reliance on centralized memory. Predictive coding offers an alternative in which inference and learning arise from local prediction-error dynamics between adjacent layers. This paper presents a digital architecture that implements a discrete-time predictive coding update directly in hardware. Each neural core maintains its own activity, prediction error, and synaptic weights, and communicates only with adjacent layers through hardwired connections. Supervised learning and inference are supported via a uniform per-neuron clamping primitive that enforces boundary conditions while leaving the internal update schedule unchanged. The design is a deterministic, synthesizable RTL substrate built around a sequential MAC datapath and a fixed finite-state schedule. Rather than executing a task-specific instruction sequence inside the learning substrate, the system evolves under fixed local update rules, with task structure imposed through connectivity, parameters, and boundary conditions. The contribution of this work is not a new learning rule, but a complete synthesizable digital substrate that executes predictive-coding learning dynamics directly in hardware.

Tags

ai-safety (imported, 100%)csne (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

35,765 characters extracted from source content.

Expand or collapse full text

A Synthesizable RTL Implementation of Predictive Coding Networks Timothy Oh University of California, Riverside (August 2025) Abstract Backpropagation has enabled modern deep learning but is difficult to realize as an online, fully distributed hardware learning system due to global error propagation, phase separation, and heavy reliance on centralized memory. Predictive coding offers an alternative in which inference and learning arise from local prediction-error dynamics between adjacent layers. This paper presents a digital architecture that implements a discrete-time predictive coding update directly in hardware. Each neural core maintains its own activity, prediction error, and synaptic weights, and communicates only with adjacent layers through hardwired connections. Supervised learning and inference are supported via a uniform per-neuron clamping primitive that enforces boundary conditions while leaving the internal update schedule unchanged. The design is a deterministic, synthesizable RTL substrate built around a sequential MAC datapath and a fixed finite-state schedule. Rather than executing a task-specific instruction sequence inside the learning substrate, the system evolves under fixed local update rules, with task structure imposed through connectivity, parameters, and boundary conditions. The contribution of this work is not a new learning rule, but a complete synthesizable digital substrate that executes predictive-coding learning dynamics directly in hardware. Artifact availability. A complete reference implementation of the architecture described in this paper is available as open-source RTL together with simulation testbenches and reproducibility scripts at https://github.com/alskaf1293/neuralcomputer The repository contains synthesizable SystemVerilog implementations of the neural core, layer composition modules, and multi-layer network construction, along with Verilator-based testbenches that reproduce the experiments reported in Section 8. Each experiment produces CSV learning curves that can be plotted directly to regenerate the figures in this paper. 1 Introduction Modern machine learning systems are typically trained by backpropagation, which computes gradients by combining global loss information with a tightly coordinated forward/backward computation schedule. Although highly effective, this paradigm is difficult to realize as a fully distributed learning substrate in hardware. The backward pass requires structured global error propagation, intermediate activation storage, and substantial movement of data through memory and interconnect. Predictive coding offers an alternative formulation in which inference and learning arise from minimizing prediction errors across a hierarchy Rao and Ballard (1999); Friston (2005). In standard predictive coding networks (PCNs), each layer predicts the layer below; each unit updates its state and synaptic weights using only locally available quantities: its own activity, its own prediction error, presynaptic activity from the adjacent layer above, and prediction errors from the adjacent layer below. This locality makes predictive coding attractive as a candidate algorithmic substrate for physically embedded learning systems. This paper presents a digital micro-architecture that directly implements predictive coding equations at the level of individual neurons. Each neural core corresponds to a single scalar unit and executes a fixed finite-state schedule per tick. Communication is strictly between adjacent layers via hardwired connections. No shared parameter memory and no global learning-phase controller are required. The objective of this work is not to propose a new learning rule, but to demonstrate a concrete mapping from predictive-coding-style local learning to a structured, synthesizable digital substrate. Contributions. • A composable neural-core architecture that implements a discrete-time predictive coding update using a sequential MAC datapath. • A uniform per-neuron clamping interface that supports supervised training and inference through boundary conditions. • A deterministic, synthesizable RTL organization based on IEEE-754 arithmetic. • A direct correspondence between predictive coding computations and hardware FSM stages. 2 Related Work Neuromorphic and local-learning hardware. A substantial body of work has targeted brain-inspired hardware using spiking neural networks (SNNs). Platforms such as Intel’s Loihi, IBM’s TrueNorth, SpiNNaker, and BrainScaleS implement on-chip local learning through spike-timing-dependent plasticity (STDP) and related Hebbian rules Davies et al. (2018); Furber et al. (2014); Schmitt et al. (2017). These systems achieve impressive energy efficiency by exploiting event-driven, asynchronous computation with sparse binary spike communication. The present work occupies a different point in this design space along three axes. First, the neural representation is continuous-valued rather than spike-based: each core maintains a scalar floating-point state that evolves continuously under local gradient dynamics, not an integrate-and-fire membrane potential driven by discrete events. Second, the learning rule is a Hebbian prediction-error update derived from a quadratic energy function, not STDP; the weight update depends on the postsynaptic prediction error and presynaptic activity, not on spike-timing coincidence. Third, the substrate is a synchronous deterministic RTL design targeting standard digital synthesis flows, rather than an asynchronous mesh or analog mixed-signal fabric. These choices prioritize a direct, verifiable correspondence between the predictive-coding update equations and the hardware datapath—at the cost of the energy efficiency gains that event-driven spiking systems provide. Whether predictive-coding dynamics and spiking event-driven computation can be usefully combined is an open question for future work. Predictive coding as hierarchical inference. Predictive coding has been developed as a functional account of cortical computation in which perception and learning arise from minimizing prediction error in hierarchical generative models Rao and Ballard (1999); Friston (2005). The free energy principle provides a broader interpretive framework in which systems maintain themselves by minimizing variational free energy, motivating predictive coding as a concrete algorithmic instantiation Friston (2009). Relationship to backpropagation. Predictive coding has been analyzed as a potential route to backpropagation-like learning rules under specific modeling assumptions. In particular, Whittington and Bogacz show that in a predictive coding network with local Hebbian plasticity, converged inference under clamped outputs yields error signals that approximate backpropagation gradients Whittington and Bogacz (2017). This connection motivates predictive-coding implementations as candidates for local learning substrates, while also clarifying the conditions under which equivalence to backpropagation is expected. Incremental predictive coding and stability. The present architecture operates in a tick-based incremental regime rather than assuming full convergence of inference between parameter updates. Salvatori et al. propose a stable and fully automatic learning algorithm for PCNs and analyze incremental predictive coding variants Salvatori et al. (2023). Related analyses connect incremental variants to incremental Expectation-Maximization (EM) viewpoints Neal and Hinton (1998) and provide convergence results for incremental EM-style procedures under appropriate conditions Karimi et al. (2019). These works motivate incremental schedules, but do not directly guarantee stability for a specific discretization and finite-precision hardware datapath. Backpropagation as a biological and hardware mismatch. Concerns about biological plausibility and mechanistic mismatch between backpropagation and neural systems are summarized by Lillicrap et al. Lillicrap et al. (2020), building on the canonical backpropagation formulation introduced by Rumelhart et al. Rumelhart et al. (1986). These critiques motivate alternatives that reduce global coordination and enforce locality. 3 Backpropagation and Its Constraints Backpropagation requires global coordination that is challenging to reconcile with distributed biological learning and with certain classes of embedded hardware systems Lillicrap et al. (2020). First, standard gradient computation requires propagating error information backward through the entire network, creating a dependency structure that is not purely local. Second, training is typically organized into distinct phases (forward, backward, update) that demand synchronization and storage of intermediate activations. Third, backpropagation assumes differentiability of the computational graph, whereas biological systems involve discontinuous and stochastic signaling Rumelhart et al. (1986). While these issues do not prevent backpropagation from being implemented on conventional accelerators, they motivate research into alternative learning formulations that admit local update structure suitable for embedded online adaptation. 4 Predictive Coding Predictive coding frames inference and learning as minimization of prediction errors across a hierarchical network Rao and Ballard (1999); Friston (2005). Let x(ℓ)∈ℝnℓx^( ) ^n_ denote the activity vector at layer ℓ , and let Θ(ℓ)∈ℝnℓ×nℓ+1 ^( ) ^n_ × n_ +1 denote the synaptic weight matrix projecting from layer ℓ+1 +1 to layer ℓ . The top-down prediction generated by layer ℓ is μ(ℓ)=Θ(ℓ)​f​(x(ℓ+1)),μ^( )= ^( )f\! (x^( +1) ), where f​(⋅)f(·) is an element-wise activation function. The prediction error is ε(ℓ)=x(ℓ)−μ(ℓ). ^( )=x^( )-μ^( ). A common quadratic prediction-error objective is E=∑ℓ‖ε(ℓ)‖2=∑ℓ‖x(ℓ)−Θ(ℓ)​f​(x(ℓ+1))‖2.E= _ \| ^( ) \|^2= _ \|x^( )- ^( )f\! (x^( +1) ) \|^2. Although E is global, its gradients decompose into strictly local terms. 4.1 Activity dynamics and synaptic updates Inference corresponds to gradient descent on E with respect to activities. For an internal layer ℓ , Δ​x(ℓ)=γ​(−ε(ℓ)+f′​(x(ℓ))⊙(Θ(ℓ−1))⊤​ε(ℓ−1)), x^( )=γ (- ^( )+f \! (x^( ) ) ( ^( -1) )^\! ^( -1) ), where ⊙ is element-wise multiplication and γ is an activity step size. Synaptic weights can be updated using a local Hebbian-like rule, Δ​Θ(ℓ)=α​ε(ℓ)​(f​(x(ℓ+1)))⊤, ^( )=α\, ^( ) (f\! (x^( +1) ) )^\! , with learning-rate scale α. Each synapse depends only on its associated presynaptic activity and postsynaptic error. 4.2 Component-wise form For neuron i in layer ℓ , μi(ℓ)=∑jΘi​j(ℓ)​f​(xj(ℓ+1)),εi(ℓ)=xi(ℓ)−μi(ℓ),μ^( )_i= _j ^( )_ijf\! (x^( +1)_j ), ^( )_i=x^( )_i-μ^( )_i, bi(ℓ)=∑kΘk​i(ℓ−1)​εk(ℓ−1),b_i^( )= _k ^( -1)_ki\, ^( -1)_k, Δ​xi(ℓ)=γ​(−εi(ℓ)+f′​(xi(ℓ))​bi(ℓ)), x^( )_i=γ (- ^( )_i+f \! (x^( )_i )b_i^( ) ), Δ​Θi​j(ℓ)=α​εi(ℓ)​f​(xj(ℓ+1)). ^( )_ij=α\, ^( )_i\,f\! (x^( +1)_j ). These equations make locality explicit: each neuron depends only on its own scalar state, its own local prediction error, presynaptic state from the adjacent layer above, and backpropagated error products from the adjacent layer below. In the RTL implementation, raw state values xjx_j are communicated between layers. The activation f​(⋅)f(·) is applied locally inside the receiving neuron when presynaptic inputs are consumed for prediction and weight update. 4.3 Discrete-time update implemented in hardware The architecture implements a tick-based discrete-time variant. For neuron i in layer ℓ , presynaptic indices j∈0,…,N−1j∈\0,…,N-1\, and an explicit bias lane j=Nj=N with feature value 11, define the effective local state used during the current tick as xi,eff(ℓ)=xobs,i(ℓ)if external clamping is asserted for neuron ​i,xi(ℓ)otherwise.x_i,eff^( )= casesx_obs,i^( )&if external clamping is asserted for neuron i,\\ x_i^( )&otherwise. cases The hardware then performs μi(ℓ) _i^( ) =∑j=0Nθi​j(ℓ)​f​(xj(ℓ+1)), = _j=0^N _ij^( )\,f\! (x_j^( +1) ), (1) εi(ℓ) _i^( ) =xi,eff(ℓ)−μi(ℓ), =x_i,eff^( )- _i^( ), (2) bi(ℓ) b_i^( ) =∑kθk​i(ℓ−1)​εk(ℓ−1), = _k _ki^( -1)\, _k^( -1), (3) xi(ℓ) x_i^( ) ←xi(ℓ)+γ​(f′​(xi,eff(ℓ))​bi(ℓ)−εi(ℓ)), ← x_i^( )+γ (f \! (x_i,eff^( ) )b_i^( )- _i^( ) ), (4) θi​j(ℓ) _ij^( ) ←θi​j(ℓ)+α​εi(ℓ)​f​(xj(ℓ+1)). ← _ij^( )+α _i^( )f\! (x_j^( +1) ). (5) Each tick performs one explicit Euler-style state update together with one local synaptic update. In the implementation, the presynaptic nonlinearity is applied when the received state from the adjacent upper layer is consumed, rather than being stored as the communicated inter-layer quantity. 5 A Digital Predictive Coding Substrate At the heart of the system is a hardware unit called a neural core that executes the local computations required by predictive coding. Each core corresponds to one indexed unit (i,ℓ)(i, ), maintains its local state and parameters, and communicates only with adjacent layers via hardwired signals. 5.1 Neural core schematic x1(ℓ+1)x^( +1)_1x2(ℓ+1)x^( +1)_2x3(ℓ+1)x^( +1)_3xi(ℓ)x^( )_ix1(ℓ−1)x^( -1)_1x2(ℓ−1)x^( -1)_2x3(ℓ−1)x^( -1)_3x4(ℓ−1)x^( -1)_4x5(ℓ−1)x^( -1)_5θi​1(ℓ)θ^( )_i1θi​2(ℓ)θ^( )_i2θi​3(ℓ)θ^( )_i3θ1​i(ℓ−1)θ^( -1)_1iθ2​i(ℓ−1)θ^( -1)_2iθ3​i(ℓ−1)θ^( -1)_3iθ4​i(ℓ−1)θ^( -1)_4iθ5​i(ℓ−1)θ^( -1)_5iℓ+1 +1ℓ ℓ−1 -1 5.2 Neural core micro-architecture Each neural core contains: 1. Local state and parameter storage. The core stores its scalar state xi(ℓ)x_i^( ), prediction error εi(ℓ) _i^( ), and a local vector of synaptic weights θi​j(ℓ)θ^( )_ij, including an explicit bias lane implemented as an additional fixed presynaptic channel. Learning rates α and activity step size γ are supplied externally but applied locally, with no shared parameter memory. 2. Presynaptic activation and gating logic. Presynaptic inputs are communicated between layers as raw state values xj(ℓ+1)x_j^( +1). Within the receiving neuron, these values are transformed by an activation f​(⋅)f(·) selected per layer when forming predictions and weight updates. The derivative f′​(x)f (x) is computed locally from the neuron’s effective state used during the current tick and gates the bottom-up term during the state update. The bias lane is treated as a constant unit feature. 3. Sequential DSP datapath. A single MAC datapath performs prediction accumulation, weight updates, and construction of backpropagated products by iterating over indices, trading throughput for reduced area. 4. Local FSM scheduler. Each core sequences computation through PRED→ERR→BACKSUM→BACKVEC→WUP→STATE.PRED . The schedule is identical during inference and learning; whether weights update and whether state values are clamped are controlled externally. 5. Hardwired neighbor interfaces. Each core communicates only: • its current state xi(ℓ)x_i^( ) downward to layer ℓ−1 -1, • products used to form the bottom-up term upward to layer ℓ+1 +1. Communication is point-to-point with no global arbitration. 5.3 FSM stages Each tick executes one complete update: • PRED: compute μi(ℓ)=∑jθi​j(ℓ)​f​(xj(ℓ+1)), _i^( )= _jθ^( )_ijf(x_j^( +1)), including the explicit bias lane. • ERR: compute and store εi(ℓ)=xi,eff(ℓ)−μi(ℓ). _i^( )=x_i,eff^( )- _i^( ). • BACKSUM: accumulate bi(ℓ)=∑kθk​i(ℓ−1)​εk(ℓ−1)b_i^( )= _kθ^( -1)_ki _k^( -1) from lower-layer signals. • BACKVEC: emit products θi​j(ℓ)​εi(ℓ)θ^( )_ij _i^( ) for all non-bias presynaptic indices j, so that the adjacent upper layer can form its bottom-up term. • WUP: update weights θi​j(ℓ)←θi​j(ℓ)+α​εi(ℓ)​f​(xj(ℓ+1)).θ^( )_ij←θ^( )_ij+α _i^( )f(x_j^( +1)). In inference-only operation, this stage still executes structurally, but becomes a numerical no-op when α=0α=0. • STATE: update state using xi(ℓ)←xi(ℓ)+γ​(f′​(xi,eff(ℓ))​bi(ℓ)−εi(ℓ)),x_i^( )← x_i^( )+γ (f \! (x_i,eff^( ) )b_i^( )- _i^( ) ), unless hard clamping overwrites the stored state at the end of the tick. 5.4 Network-level clamping for supervised learning and inference Each core exposes: • x_set_en: enables an externally supplied observation for the current tick, • x_obs: observed state value. The RTL defines an effective state xi,eff(ℓ)=xobs,i(ℓ)if x_set_eni=1,xi(ℓ)otherwise,x_i,eff^( )= casesx^( )_obs,i&if x\_set\_en_i=1,\\ x_i^( )&otherwise, cases which is used during the tick when computing local prediction error and the local derivative gate. The stored state register is then updated at the end of the tick as xi(ℓ)←xobs,i(ℓ)if CLAMP_HARD=1∧x_set_eni=1,xi(ℓ)+γ​(f′​(xi,eff(ℓ))​bi(ℓ)−εi(ℓ))otherwise.x_i^( )← casesx^( )_obs,i&if CLAMP\_HARD=1\ \ x\_set\_en_i=1,\\ x_i^( )+γ (f \! (x_i,eff^( ) )b_i^( )- _i^( ) )&otherwise. cases Thus, clamping affects not only the final stored state, but also the computation performed during the tick whenever an external observation is present. Supervised learning clamps boundary layers such as input and target output layers, while inference clamps only the input and reads the free output after a fixed tick budget. 5.5 Tick-based execution model Computation proceeds in discrete ticks controlled by an external clock. Each tick triggers a full execution of the neural-core finite-state schedule. At the network level, a global start_tick request is converted into an internal start pulse once the network is idle, and this pulse is then broadcast to all layer modules and neural cores. Each core asserts a done signal upon completion of its local update. Layer modules latch per-core completion and emit a one-shot layer-level done pulse once all neurons in the layer have completed. The network module applies the same aggregation strategy across layers and emits a one-shot network-level done pulse once all layers have completed the current tick. This organization yields a deterministic tick schedule without requiring any asynchronous coordination inside the neural cores themselves. From a dynamical-systems perspective, each tick corresponds to one explicit Euler-style step of the predictive-coding dynamics described in Section 4.3. 6 Implementation Details 6.1 Arithmetic format All arithmetic is performed in IEEE-754 single precision using HardFloat recoded format (recFN). Rounding mode is fixed to round-to-nearest-even. Multiply, add, and fused multiply-add units are reused sequentially across indices. 6.2 Sequential datapath Each core iterates across presynaptic indices during prediction and weight update stages. As a result, the cycle count per tick scales linearly with fan-in. This design choice trades throughput for reduced area and a uniform per-core implementation. 6.3 Bias lane A bias lane is implemented as an additional presynaptic feature with constant value 11. The bias weight can optionally use an independent learning-rate scale or be frozen. 6.4 Hardware cost model Because each neural core reuses a sequential multiply–accumulate datapath, the cycle count per tick scales linearly with fan-in. Let N denote the number of true presynaptic inputs and M the number of backpropagated error inputs. The implemented FSM executes: • N+1N+1 cycles for prediction accumulation, including the explicit bias lane, • 11 cycle for error formation, • M cycles for accumulation of lower-layer back inputs, • N cycles to emit the upward back vector, • N+1N+1 cycles for weight updates, again including the bias lane, • 11 cycle for the state update. Accordingly, the per-tick cycle count is approximately Ctick=(N+1)+1+M+N+(N+1)+1=3​N+M+4.C_tick=(N+1)+1+M+N+(N+1)+1=3N+M+4. Thus the design has linear tick latency in both presynaptic fan-in and incoming back-error fan-in. This sequential organization reduces area per neuron but increases latency per tick. Higher-throughput implementations could introduce parallel MAC units or vectorized datapaths. 7 Stability and Convergence Considerations Predictive coding is often presented as gradient descent on a prediction-error energy in continuous time Friston (2005). The hardware described here implements a discrete-time variant with finite precision, a fixed per-tick computation schedule, optional clamping, and simultaneous activity and weight updates. These departures matter: stability properties of idealized continuous-time dynamics do not automatically transfer to a tick-based explicit Euler discretization in floating-point arithmetic. Accordingly, we treat stability in this work as an empirical property to be characterized as a function of step sizes, initialization, activation choice, and tick budget. 7.1 Relation to backpropagation and incremental predictive coding Predictive coding can approximate backpropagation-like error signals under converged inference and standard modeling assumptions Whittington and Bogacz (2017). The present system does not assume inference convergence between parameter updates, and instead operates in an incremental regime with a fixed number of ticks per example. Recent analyses of incremental predictive coding motivate such schedules and connect them to incremental EM viewpoints Salvatori et al. (2023); Neal and Hinton (1998); Karimi et al. (2019), but do not directly guarantee stability for a particular hardware discretization. 8 Experiments Experiments are implemented as Verilator simulations using the reference RTL implementation provided in the public repository. Learning and inference are controlled entirely through clamping and learning-rate parameters without changing the internal neural-core schedule. All plots use a logarithmic vertical axis. 8.1 Teacher–student regression A three-layer network (2→4→32→ 4→ 3) with a ReLU hidden layer is trained to match a fixed teacher y=Agt​ReLU​(Bgt​x)y=A_gt\,ReLU(B_gtx). Input and output layers are clamped; the hidden layer is free. MSE begins at 0.3412070.341207, spikes briefly to 0.3691990.369199 at epoch 11, then descends rapidly to 0.0047840.004784 by epoch 77 before settling to 0.0059350.005935 at epoch 2525 (Figure 1). The two-phase profile — sharp initial descent followed by a slow plateau — is consistent with the incremental tick regime, in which inference does not fully converge between weight updates. Figure 1: Training curve for the 2→4→32→ 4→ 3 ReLU network. After a brief transient at epoch 11, MSE descends rapidly then settles into a slow-improvement plateau. 8.2 Nonlinear regression with hidden tanh layer A smaller network (2→2→12→ 2→ 1) with a tanh hidden layer is trained on targets y=Agt​tanh⁡(Bgt​x+b1)+b2y=A_gt (B_gtx+b_1)+b_2. MSE begins at 1.1065121.106512, spikes to 1.9866101.986610 at epoch 11, then collapses to 0.0047250.004725 by epoch 33 — a reduction of more than two orders of magnitude in two epochs. Subsequent improvement is slow and monotone, reaching 0.0043820.004382 at epoch 2525 (Figure 2). Figure 2: Training curve for the 2→2→12→ 2→ 1 tanh network. MSE drops more than two orders of magnitude by epoch 33 and subsequently improves slowly to a small residual. 8.3 Architectural scaling To test whether the local update dynamics generalize across network sizes, the same tick schedule, clamping interface, and training protocol are applied to three architectures using a single parameterized testbench (tb_scale_function.sv): 2→4→3,4→8→4,8→16→8.2→ 4→ 3, 4→ 8→ 4, 8→ 16→ 8. No changes are made to the neural-core FSM or clamping logic; only the compile-time dimension parameters differ. All three networks exhibit rapid initial descent (74%74\%, 67%67\%, and 88%88\% MSE reduction by epoch 11) followed by a stable residual floor that rises modestly with dimension (Figure 3). Larger networks settle at higher residuals, consistent with the incremental regime: wider hidden layers require more ticks per sample to reach equivalent inference equilibration. The key result is that no RTL modifications are needed across scales. Architectural dimension is a structural parameter of the network instantiation; the per-neuron FSM and learning rule are unchanged. Larger networks are realized by instantiating more neural cores, not by redesigning the substrate. Figure 3: Training curves for three architectures of increasing dimension. All exhibit rapid initial descent followed by a stable residual floor that rises modestly with network size. 8.4 Quantitative summary Experiment Architecture Initial MSE Final MSE Teacher–student (ReLU) 2→4→32→ 4→ 3 0.341207 0.005935 Hidden tanh 2→2→12→ 2→ 1 1.106512 0.004382 Scaling: small 2→4→32→ 4→ 3 0.229206 0.006053 Scaling: medium 4→8→44→ 8→ 4 0.248934 0.010317 Scaling: large 8→16→88→ 16→ 8 0.273252 0.017752 Table 1: Final MSE values reported at epoch 25. The top two rows are the reference regression experiments; the bottom three are the architectural scaling sweep. Across all five runs, end-to-end supervised learning proceeds using only local state variables, local prediction errors, and adjacent-layer communication under a fixed FSM schedule. Nonzero residual floors in all runs motivate further study of tick budget and step-size tuning. 8.5 Evaluation protocol Learning proceeds in alternating phases: an inference phase (α=0α=0, γ>0γ>0) in which network state settles under clamped boundary conditions, and a learning phase (α>0α>0, γ>0γ>0) in which weights are updated while state continues evolving. The distinction is imposed entirely through boundary conditions and learning-rate parameters, with no change to the internal hardware schedule. 9 Discussion: Toward Embedded Adaptive Computation The proposed system is motivated by a practical hardware question: what changes when learning is implemented as a local dynamical process rather than as a procedure that alternates between forward evaluation, backward gradient propagation, and parameter updates stored in centralized memory. Predictive coding is attractive in this context because its update rules decompose into strictly local terms, allowing each unit to update its state and synapses using only adjacent-layer signals. The resulting substrate resembles a distributed physical process more than a conventional program executed by a centralized controller. This perspective shifts where task structure is expressed in the system. In a conventional von Neumann setting, task behavior is typically specified primarily through externally authored software executed on a fixed substrate. In the present architecture, behavior is shaped more directly by the interaction between inputs, local state dynamics, local plasticity, and externally imposed boundary conditions. 9.1 Relationship to Mortal Computation Hinton introduces the concept of mortal computation to describe systems in which algorithm and hardware are inseparable: because each physical instance has unknown analog variations, the trained parameters are useful only for that particular piece of hardware and die with it Hinton (2022). The present work sits closer to this end of the design space than to conventional immortal computation. The predictive-coding update rules are not software executing on a general-purpose substrate — they are directly instantiated as the hardware datapath itself. The FSM schedule, the MAC accumulation, the local state registers, and the weight update logic are the process; there is no separation between hardware and software. This framing is descriptive rather than foundational. The value is to highlight how a fixed local dynamical substrate can support both inference and adaptation without separate programmed learning phases — a structural property it shares with the mortal-computation perspective, even though the present design uses synchronous digital logic rather than analog circuits. Whether extending this further toward true analog mortal computation — accepting hardware-specific parameter fidelity in exchange for energy efficiency — would benefit predictive-coding substrates is an open question for future work. 9.2 Relationship to Moravec’s Paradox Moravec’s Paradox highlights an asymmetry between symbolic manipulation, which maps cleanly onto sequential digital computation, and sensorimotor inference, which is robust in biological systems but challenging for conventional architectures Moravec (1988). The present work does not claim to resolve this paradox. Rather, it provides a concrete substrate for testing a related hypothesis: some aspects of perception and control may be better realized as fast local inference in distributed dynamical systems than as explicit symbolic computation mediated by global memory. Because the proposed architecture performs continual local error minimization under fixed update rules, it may be well suited to tasks naturally expressed as inference under constraints, particularly in noisy, streaming, or partially observed settings. Substantiating this hypothesis requires empirical comparisons to conventional digital baselines under matched compute and energy budgets. 9.3 Limitations and future directions Several limitations follow from the current design. First, the architecture reuses a sequential floating-point datapath that iterates over presynaptic indices, reducing area per neuron but increasing tick latency as fan-in grows. Second, synthesizable implementations of nonlinear activations and their derivatives require careful numerical design. Third, convergence and stability guarantees for the coupled discrete-time, finite-precision, clamped system are not implied by continuous-time analyses; stability regions should be mapped empirically and, where possible, analyzed theoretically. These limitations motivate future work on architectural scaling (parallelism vs area/power), activation approximations suitable for synthesis, and task-driven benchmarks that identify regimes where local online inference is beneficial. Appendix A Reproducing the Experiments All experiments described in this paper can be reproduced using the public repository. git clone https://github.com/alskaf1293/neuralcomputer cd neuralcomputer ./scripts/test_all.sh Running the script executes the full suite of simulation experiments. Outputs are written as CSV files under the runs/ directory. The repository also provides Python scripts that generate the plots reported in the paper from these outputs. Acknowledgments I am especially grateful to Greg Ver Steeg for his encouragement and for many conversations that helped sustain this project from beginning to end. References [1] M. Davies, N. Srinivasa, T. Lin, G. N. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C. Lin, A. Lines, R. Liu, D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, Y. Weng, A. Wild, Y. Yang, and H. Wang (2018) Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38 (1), p. 82–99. External Links: Document Cited by: §2. [2] K. Friston (2005) A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences 360 (1456), p. 815–836. External Links: Document Cited by: §1, §2, §4, §7. [3] K. Friston (2009) The free-energy principle: a rough guide to the brain?. Trends in cognitive sciences 13 (7), p. 293–301. External Links: Document Cited by: §2. [4] S. B. Furber, F. Galluppi, S. Temple, and L. A. Plana (2014) The SpiNNaker project. Proceedings of the IEEE 102 (5), p. 652–665. External Links: Document Cited by: §2. [5] G. Hinton (2022) The forward-forward algorithm: some preliminary investigations. arXiv preprint arXiv:2212.13345. Cited by: §9.1. [6] B. Karimi, H. Wai, E. Moulines, and M. Lavielle (2019) On the global convergence of (fast) incremental expectation maximization methods. arXiv preprint arXiv:1910.12521. Cited by: §2, §7.1. [7] T. P. Lillicrap, A. Santoro, L. Marris, C. J. Akerman, and G. E. Hinton (2020) Backpropagation and the brain. Nature Reviews Neuroscience 21 (6), p. 335–346. External Links: Document Cited by: §2, §3. [8] H. P. Moravec (1988) Mind children: the future of robot and human intelligence. Harvard University Press, Cambridge, MA. Cited by: §9.2. [9] R. M. Neal and G. E. Hinton (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, p. 355–368. Cited by: §2, §7.1. [10] R. P. Rao and D. H. Ballard (1999) Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience 2 (1), p. 79–87. External Links: Document Cited by: §1, §2, §4. [11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. Nature 323 (6088), p. 533–536. Cited by: §2, §3. [12] T. Salvatori, Y. Song, Y. Yordanov, B. Millidge, Z. Xu, L. Sha, C. Emde, R. Bogacz, and T. Lukasiewicz (2023) A stable, fast, and fully automatic learning algorithm for predictive coding networks. arXiv preprint arXiv:2212.00720. Cited by: §2, §7.1. [13] S. Schmitt, J. Klähn, G. Bellec, A. Grübl, M. Guettler, A. Hartel, S. Hartmann, D. Husmann de Oliveira, K. Husmann, V. Karasenko, M. Kleider, C. Koke, C. Mauch, E. Müller, P. Müller, J. Partzsch, M. A. Petrovici, S. Schiefer, S. Scholze, B. Vogginger, R. Legenstein, W. Maass, C. Mayr, J. Schemmel, and K. Meier (2017) Neuromorphic hardware in the loop: training a deep spiking network on the BrainScaleS wafer-scale system. p. 2227–2234. External Links: Document Cited by: §2. [14] J. C. R. Whittington and R. Bogacz (2017) An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural Computation 29 (6), p. 1229–1262. External Links: Document Cited by: §2, §7.1.