Paper deep dive
A Variational Latent Equilibrium for Learning in Neuronal Circuits
Simon Brandt, Paul Haider, Walter Senn, Federico Benitez, Mihai A. Petrovici
Abstract
Abstract:Brains remain unrivaled in their ability to recognize and generate complex spatiotemporal patterns. While AI is able to reproduce some of these capabilities, deep learning algorithms remain largely at odds with our current understanding of brain circuitry and dynamics. This is prominently the case for backpropagation through time (BPTT), the go-to algorithm for learning complex temporal dependencies. In this work we propose a general formalism to approximate BPTT in a controlled, biologically plausible manner. Our approach builds on, unifies and extends several previous approaches to local, time-continuous, phase-free spatiotemporal credit assignment based on principles of energy conservation and extremal action. Our starting point is a prospective energy function of neuronal states, from which we calculate real-time error dynamics for time-continuous neuronal networks. In the general case, this provides a simple and straightforward derivation of the adjoint method result for neuronal networks, the time-continuous equivalent to BPTT. With a few modifications, we can turn this into a fully local (in space and time) set of equations for neuron and synapse dynamics. Our theory provides a rigorous framework for spatiotemporal deep learning in the brain, while simultaneously suggesting a blueprint for physical circuits capable of carrying out these computations. These results reframe and extend the recently proposed Generalized Latent Equilibrium (GLE) model.
Tags
Links
- Source: https://arxiv.org/abs/2603.09600v2
- Canonical: https://arxiv.org/abs/2603.09600v2
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/13/2026, 1:03:33 AM
Summary
The paper introduces Variational Latent Equilibrium (VLE), a biologically plausible framework for spatiotemporal credit assignment in neuronal networks. VLE approximates backpropagation through time (BPTT) by deriving local learning rules from a prospective energy function, extending the Generalized Latent Equilibrium (GLE) model. It addresses the weight transport problem and causality issues by introducing learnable backward weights, enabling local, time-continuous learning in physical neuronal circuits.
Entities (5)
Relation Signals (3)
Variational Latent Equilibrium → approximates → Backpropagation Through Time
confidence 95% · In this work we propose a general formalism to approximate BPTT in a controlled, biologically plausible manner.
Variational Latent Equilibrium → extends → Generalized Latent Equilibrium
confidence 95% · These results reframe and extend the recently proposed Generalized Latent Equilibrium (GLE) model.
Variational Latent Equilibrium → solves → Weight Transport Problem
confidence 90% · VLE uses such a mechanism to improve on the approximations made by GLE while still respecting temporal (as well as spatial) locality.
Cypher Suggestions (2)
Find all models extended by VLE · confidence 90% · unvalidated
MATCH (v:Framework {name: 'Variational Latent Equilibrium'})-[:EXTENDS]->(m) RETURN m.nameIdentify algorithms approximated by VLE · confidence 90% · unvalidated
MATCH (v:Framework {name: 'Variational Latent Equilibrium'})-[:APPROXIMATES]->(a:Algorithm) RETURN a.nameFull Text
54,385 characters extracted from source content.
Expand or collapse full text
A Variational Latent Equilibrium for Learning in Neuronal Circuits Simon Brandt 1 , Paul Haider 1 , Walter Senn 1 , Federico Benitez 1,+,* , and Mihai A. Petrovici 1,+ 1 Department of Physiology, University of Bern, B ̈uhlplatz 5, 3012, Bern, Switzerland + Joint senior authorship * federico.benitez@unibe.ch Abstract Brains remain unrivaled in their ability to recognize and generate complex spatiotemporal patterns. While AI is able to reproduce some of these capabilities, deep learn- ing algorithms remain largely at odds with our current understanding of brain circuitry and dynamics. This is prominently the case for backpropagation through time (BPTT), the go-to algorithm for learning complex tem- poral dependencies. In this work we propose a general formalism to approximate BPTT in a controlled, biolog- ically plausible manner. Our approach builds on, unifies and extends several previous approaches to local, time- continuous, phase-free spatiotemporal credit assignment based on principles of energy conservation and extremal action. Our starting point is a prospective energy function of neuronal states, from which we calculate real-time error dynamics for time-continuous neuronal networks. In the general case, this provides a simple and straightforward derivation of the adjoint method result for neuronal net- works, the time-continuous equivalent to BPTT. With a few modifications, we can turn this into a fully local (in space and time) set of equations for neuron and synapse dynamics. Our theory provides a rigorous framework for spatiotemporal deep learning in the brain, while simul- taneously suggesting a blueprint for physical circuits ca- pable of carrying out these computations. These results reframe and extend the recently proposed Generalized La- tent Equilibrium (GLE) model. 1 Introduction The brain remains unsurpassed in its ability to recognize and generate complex spatiotemporal patterns. When learning such a task, synaptic connections between neu- rons, and thus the brain’s behavior, are changed to in- crease performance. The performance can be described by a cost function that measures how well the behavior aligns with a target behavior: to achieve optimal perfor- mance, system parameters need to change such that the cost is minimized. This optimization has to take place un- der the physical and biological constraints of the neuronal system. Minimizing the cost, while forcing the dynamics of the system to obey certain pre-defined physical and biological constraints, can be described as a constrained optimization problem. A general method to solve such a problem is the application of calculus of variations, taking inspiration from physics and from the theory of optimal control. It is important to emphasize that in biology, such an optimal solution needs to be learned by a network of physical neurons. In turn, for the learning procedure to be biologically plausible, it is crucial to not violate the constraints stemming from biology and physics. Recently [1] proposed a biologically plausible algorithm, local in space and time, to implement spatio-temporal learning in physical networks of neurons. This framework, Generalized Latent Equilibrium (GLE), extends and gen- eralizes Latent Equilibrium (LE) [2] to enable networks to perform temporal processing tasks. To do so, these frameworks leverage the temporal processing capabilities of biological neurons in two ways. As a first ingredient, neuronal membrane integration is well-known to perform low-pass filtering of the input, and thus memory effects associated with signal delays. Additionally, it is well- established though less well-known that neurons can gen- erally react prospectively to their input, so that the output is based on the expected future input. For a more detailed discussion of temporal processing at the level of individual neurons see the recent [3]. These two temporal operations are implemented locally at the level of individual neurons, leading to biologically plausible rules of neuron dynam- ics and learning through synaptic updates. In fact, GLE can be shown to approximate the adjoint method (AM), which is the standard method to solve constrained opti- mization problems in continuous time. While LE deals with purely spatial problems, and is related to methods such as equiprop [4], and the neuronal least action princi- ple [5], GLE solves spatiotemporal problems and is related to backpropagation through time (BPTT) [6, 7] and real- time recurrent learning (RTRL) [8]. In this work, we improve on these previous results on sev- eral ways. First, we go back to a first-principles approach in terms of variational calculus, which allow for a unified description of many of these related methods. Secondly, we give an explicit derivation of neuronal dynamics for 1 arXiv:2603.09600v2 [q-bio.NC] 12 Mar 2026 systems of neurons equipped with these temporal process- ing capabilities. This derivation recovers the results of the AM, while being much simpler and conceptually more transparent than standard derivations of the AM. In order to recover biological plausibility, we show which explicit approximations within our formalism give rise to GLE dy- namics. Additionally, we propose a systematic method to correct for the distortion involved in the GLE approxi- mation in a biologically plausible manner. We provide some promising first applications in learning to reproduce complex temporal behavior. 2 Results We consider a general setup for a neuronal network, ca- pable of representing systems of physical neurons such as biological and neuromorphic neurons. We denote the input signal by x(t), and the output of the network by y(t). In a feedforward network this output will be gen- erated by the last layer of the network. In the specific case of supervised learning, the target output is denoted by the vector y ∗ (t), and the instantaneous cost function by C(y(t),y ∗ (t)). Networks depend on parameters de- noted by the vector θ, which include, for example, synap- tic weights W and time constants τ . In the most general case of a temporal processing task, we want to minimize the integrated cost C = R t 2 t 1 dtC(t) with respect to the parameters θ, by means of some im- plementation of gradient descent: ̇ θ ∼−∇ θ C .(1) As the dynamics of the network are constrained by its physical properties, this constitutes an example of a con- strained optimization problem [9–11]. A general solution to this problem is provided by the pow- erful AM [12], but its derivation is usually very compli- cated and does not take into account the specific char- acteristics of bio-inspired neuronal dynamics. Here we reach the same outcome as the AM while using a simple derivation that is better suited to the needs of compu- tational neuroscience. To achieve this, we harness the dual power of (i) an energy-based approach inspired by the many results in physics that use the minimization of a function to encode physical dynamics, and (i) a spe- cific functional form for the constraints, inspired by the phenomenon of prospectivity in biological neuronal net- works [3]. Our derivation encompasses and extends previ- ous results such as LE [2], and GLE [1], while being easily adaptable to also derive the neuronal least action (NLA) principle [5], all of which can be shown to be related to our overarching formalism. In the following, we use scalars rather than vectors to highlight locality, a key factor in bio-plausibility. Follow- ing assumption (i), we postulate a network energy E that is a sum over squared neuron-local errors e i and the global cost C multiplied by a small nudging parameter β: E(t) = 1 2 X i e 2 i (t) + βC(t).(2) The specific form of these errors ultimately determines the structure and dynamics of the system. As we show in Methods, for β → 0 the minimization of this energy with respect to the parameters θ is equivalent to the minimiza- tion of the cost C. For networks of leaky integrator neurons, we define the local mismatch error as e i = ̆u m i − X j W ij φ j ( ̆u r j ).(3) W ij are the synaptic input weights and the output rate of a neuron is defined as r i = φ i ( ̆u r i ), with activation φ i as a function of the neuron’s membrane voltage u i . Through- out this manuscript, we use a series of shorthands for tem- poral operators, as described in Table 1. exponential filteringtemporal differentiation forward- facing discounted future eu m = 1 τ m R ∞ t u(t ′ ) exp t−t ′ τ m dt ′ look-ahead ̆u r = 1 + τ r d dt u(t) backward- facing low-pass u m = 1 τ m R t −∞ u(t ′ ) exp t ′ −t τ m dt ′ look-back bu r = 1− τ r d dt u(t) Table 1: The temporal operators can be separated into taking the future (forward facing) or past (backward facing) into account These operators represent retrospective (past-facing) as well as prospective (future-facing) behaviors, both ob- served in biological neurons. Briefly stated, the low-pass filter operator stems from membrane integration, and the look-ahead operator represents the ability of cortical neu- rons to be sensitive not only to the current input, but to how this input is changing, and to react prospectively to that information. For a detailed discussion we point the reader to the recent [3]. It is easy to check that these operators are pairwise inverses of each other: ˆ ̃u r r = ̆ ̄u r r = u(4) For learning, we derive local plasticity rules by gradient descent. Locality follows by virtue of the locality of the individual neuronal energies E i = 1 2 e 2 i defined above. The plasticity rules read: ̇ θ i ∼− ∂E ∂θ i =−e i ∂e i ∂θ i (5) as well as a possible additional nudging term proportional to β — usually only present in output neurons during 2 training. The index of θ i indicates its association to neu- ron i and thereby its unique appearance in each E i . In particular, for synaptic weights, we get dynamics ̇ W ij = e i r j (6) We now derive error propagation from the energy as well; to account for the integrated cost C, we need to consider the integrated energy 1 E = R t 2 t 1 dtE(t). Following standard methodology, we require neuronal dy- namics to keep the integrated energy stationary: δ u E = 0(7) This variational approach can be viewed as a form of equi- librium in the state of lowest integrated energy rather than just the lowest instantaneous energy. Furthermore, as we show in more detail below, the outcome of the method links directly to (G)LE. Because of these reasons, we call this approach variational latent equilibrium (VLE). Finding the extremum of E is equivalent to having neu- ronal dynamics obey the corresponding Euler-Lagrange equations, which yield the error dynamics (see Methods for details) e i = ̃ε m i with ε i = φ ′ i X k W ki e k ˆ r ,(8) which is exactly the result of the AM for this type of system [1]. Note that this is derived in a much simpli- fied way compared to standard treatments, thanks to the specific form taken by dynamical variables when modeling prospective behavior. It is remarkable that one can repro- duce these results from first principles at a small fraction of the usual analytical cost, given that the range of appli- cation of our dynamics is still ample within computational neuroscience. As in standard AM, this expression for error propagation violates causality because the discounted future error de- pends on future states. Solutions to this issue usually re- quire biologically implausible treatments. On top of this, strictly speaking the expression also violates spatial local- ity, because backward errors e j need to propagate through forward synapses W ji , an expression of the classical weight transport problem. Below we show how to deal with both these issues. To explicitly see how errors couple into neuronal dynam- 1 As energy-based approaches such as ours are often inspired by physics, they have sometimes adopted further terminology. For ex- ample, the energy has also been called a Lagrangian, and its integral an action in, e.g., [5]. The term ”Lagrangian” also appears in the context of optimization, but rather as a link to the Lagrange mul- tiplier method (e.g., in [13, 14]). Here, we eschew this terminology in order to avoid stronger implications from theoretical physics that do not apply here. ics, we can simply rearrange Eqn. (3) to obtain τ m ̇u i =−u i + X j W ij φ j ( ̆u r j ) + e i (9) with “forward” propagation of signals φ j ( ̆u r j ) and “back- ward” propagation of the errors e j in Eqn. (8) A natural interpretation of Eqn. (9) corresponds to a three-compartment neuron, with a somatic compartment storing u i , an input compartment storing P j W ij φ j ( ̆u r j ) and an error compartment storing e i ; in cortex neurons, a likely (but not strictly necessary) assignment of such compartments would be for the input compartment to be identified as basal and the error compartment as apical. Note that in the AM, neuronal activity is not affected by errors; in our networks, this can be reproduced in the limit of β → 0 (see Methods). At this point we can easily consider the special case τ m = τ r ; then, as discussed above Eqn. (4), the discounted future and look-back operators cancel each other out and VLE reduces to LE [2]: e i = φ ′ i ( ̆u m i ) P j W ji e j . This is equivalent to requiring that neuronal dynamics conserve the instantaneous energy ∂ u E(t) = 0 (which is indeed the starting point of LE), which yields the instantaneous errors necessary to reduce the instantaneous cost. 2 Im- portantly, in LE there is no issue with causality, as the dynamics only depend on the instantaneous state of the neurons and error signals. Setting aside this special case, to achieve temporal local- ity in the error stream we approximate the discounted future operator by the look-ahead; additionally, we can approximate the look-back by the low-pass filter to make error dynamics even more neuron-like, and which has the additional advantage of partially compensating for the ap- proximation made to the forward signal. With these ap- proximations, we get e i ≃ ̆ε m i with ε i ≃ φ ′ i X k W ki e k r (10) and recover the GLE equations. The best way to un- derstand this approximation is through Fourier analysis which yields the Fourier transforms and the correspond- ing gains for each operator, as shown in Table 2. The approximations towards GLE conserve the phase shift of all Fourier components in the filtered signal while main- 2 These instantaneous equations are also related to those derived in the NLA framework [5], but with an important difference: in NLA, instantaneity is achieved by having each neuron use its low- pass filter to undo its presynaptic partners’ prospective filters. Thus, for each neuron, all its postsynaptic partners need to share their τ m and all its presynaptic partners their τ r (i.e., given a network graph, all neurons within a partition — for example, within a layer — need to share time constants). Furthermore, since neurons re- ceiving an external input necessarily carry out a low-pass filter that is never compensated for, the network cannot react in the instanta- neous manner required by exact error backpropagation. 3 a b 10 1 10 2 10 3 t[s] 10 −11 10 −6 10 −1 train loss Blearned B=W T Bconstant c 10 1 10 2 10 3 t[s] −2 0 2 weight W 0 B 0 W 1 B 1 d 10 1 10 2 10 3 t[s] W 0 W 1 e 10 1 10 2 10 3 t[s] W 0 W 1 f −2−10 t[s] 0.5 1.0 1.5 r out [Hz] g 303132 t[s] Blearned B=W T Bconstant target 100010011002 t[s] Figure 1: Learning a simple chain in a student-teacher setup. (a) Signal (bottom-up) and error (top-down) neurons, their incoming and outgoing firing rates and their dynamics. (b) A chain of two neurons as the simplest network structure for studying error backpropagation. (c) The network converges to a minimal loss if the backwards weights B are learned (blue) or fixed to B = W T (orange). For constant B (purple), the network does not converge. We plot a moving average of the loss to remove high-frequency oscillations. (d-f) The student (opaque) and teacher (transparent) weights. Student W converge to the teacher weights if B is learned (d) or B = W T (e) and diverge for constant B (f). The learned backward weights B (d, pink) converge to the expected value given the time constant of the neuron and the frequency of the signal. (g) compares the output of the three scenarios before learning, during learning and after learning to the target signal. ˆx m (t) ̆x m (t) ̄x m (t) ̃x m (t) H(ω)1− iωτ m 1 + iωτ m 1 1 + iωτ m 1 1− iωτ m G(ω) p 1 + (ωτ m ) 2 p 1 + (ωτ m ) 2 1 p 1 + (ωτ m ) 2 1 p 1 + (ωτ m ) 2 Table 2: The temporal operators can directly be translated into Fourier space each single frequency component ω. The Fourier tran- form H(ω) yields the gain G(ω) for each operator. taining causality, at the cost of distorting their amplitudes by a gain factor q 1 + (ωτ ) 2 for each of the two operator switches [1]. To see how VLE can go beyond this approximation we first discuss the issue of spatial locality, the (in)famous weight transport problem. For this, we introduce back- ward weights B ik e i = ̆ε m i with ε i = φ ′ i X k B ik e k r .(11) The neurons required to implement the signal and error dynamics are symmetric in their temporal processing by low-pass filtering input and and advancing their output prospectively, while having different weights in the for- ward and backward path, as shown in Fig. 1a and b. As a first approximation, backward weights can be ran- dom and constant, so that the learning of the forward weights can compensate for the deviations in the error signal—what is known in the literature as Feedback Align- ment (FA) [15]. Unfortunately, this solution is of lim- ited help for multi-layered networks, where weight learn- ing cannot compensate for the combined effects of several layers of forward and backward weights. More robustly, the backward weights could learn to mirror their recipro- cal forward synapses W ji through specific learning rules, e.g., phaseless alignment learning (PAL) [16] (but see also [17] for rate-based models such as we discuss in this paper, and also [18] and the recent [19] for the spiking case). VLE uses such a mechanism to improve on the approxi- mations made by GLE while still respecting temporal (as well as spatial) locality. By learning backward weights that compensate for the gain offset in GLE, the network can better converge to the exact AM solution, and the dynamics can remain local. To see how this works, we start by defining instantaneous errors for AM and VLE, and write down how they prop- agate: AM : e i = ̃ε m i with ε i = φ ′ i X k W ki e k ˆ r (12) VLE : e i = ̆ε m i with ε i =φ ′ i X k B ik e k r (13) To equalize the two, we can learn a B that compensates for the difference between the look-ahead and the dis- counted future operators, so that e AM i ! = e VLE i . In fact we can consider the even stronger constraint: W ji ̃ε m j ! = B ij ̆ε m j (14) 4 To achieve the above, we simply use local gradient descent to minimize E B ij = W ji ˆ ̆ε j r r −B ij ˆ ̆ε j m m 2 (15) ⇒ ̇ B ij ∼− ∂E B ij ∂B ij = W ji ˆ ̆ε j r r −B ij ˆ ̆ε j m m ˆ ̆ε j m m ,(16) where both look-aheads and look-backs are causal and thus local in time (see Methods for a detailed derivation). We used ε j = ˆ ̆ε j r r to compensate for the difference of the low-pass filter in VLE and the look-back operator in AM. Since requiring B ij = W ji for the error neurons is bio- logically not plausible, we point out that this expression can be combined with PAL to obtain a learning rule that is fully local in space and time (see Methods for more details). Lagline To demonstrate how learning of the backward weights B influences the training of networks, we use a simple chain of two neurons in a student-teacher setup as depicted in Fig. 1b. In the task, a student network has to adapt its weights to approximate the output of a teacher network with the same network topology. Because of the simplicity of the network and the choice of nonlinearities, the task has an unique solution—learning the teacher weights. The weights of the student are initialized as W ij =−W ∗ ij , with W ∗ 01 = 1 and W ∗ 12 = 2 being the teacher weights. The re- sults are shown in Fig. 1, where we validate three scenar- ios: learning B with Eqn. (16) (blue), using B = W T (orange) and B = const. (purple). While the former two converge towards a loss minimum, the latter corre- sponds to Feedback Alignment with wrongly initialized weights and does not converge (Fig. 1c). Consistently, the weights converge towards the teacher weights for learning B (Fig. 1d) and using B = W T (Fig. 1e), but not for B = const. (Fig. 1f). Because the input to the network is a simple sine, we can estimate the target for the learned backwards weights, to which their value B indeed con- verges to (Fig. 1d). Since we initialized the weights differ- ent to the teacher network, the output for each scenario is off the target before learning (Fig. 1g). During learn- ing, the outputs approach the target while the weights are optimized (Fig. 1f). After learning, the scenarios of learning B and using B = W T converge to the target while B = const. does not converge to a good solution. While we claim that learning B is beneficial for the learn- ing of the forward weights, in this simple task, B = W T converges to the teacher weights faster. Our algorithm re- quires more time to converge, as every single synapse has to independently learn W and B. To demonstrate the strength of our method to effectively induce a synapse specific learning rate through the learning of B, we will explore more complex task that involve larger networks of neurons. a 10 1 10 2 10 3 t[s] 10 −4 10 −3 test loss B learned B=W T b −1 0 1 input c −4−20 t[s] −0.25 0.00 0.25 r [Hz] d 404244 t[s] 100051000710009 t[s] r 20 r 21 r 22 r ∗ 20 r ∗ 21 r ∗ 22 Figure 2: Learning in a network of neurons. (a) Forward net- work structure (error neurons omitted for readability). (b) Network loss over time, evaluated on multiple test signals with randomly sam- pled frequencies and over multiple seeds. Orange: backward weights follow exact transposes of the forward weights. Blue: backward weights learn using our local update rule (Eqn. (16)). (c) Training input, consisting of multiple overlayed sines of different frequency, presented to the network at different times. (d) Output of the stu- dent network learning B (solid) compared to the teacher network (dashed) before training, during training and after training. Lagnet To investigate the performance of our model when solving more complicated problem, we train a network of neu- rons with multiple outputs, as shown in Fig. 2a, again in a teacher-student setup. Here, both the student and the teacher receive an input signal that is a combination of sines with different frequency and amplitude as shown for different times during training in (Fig. 2c). The net- work has two hidden layers with two and three neurons and three output neurons. For such a network there is no longer a unique solution, and the learned weights can differ from the teacher weights. Before learning, the out- put of the student differs from the output of the teacher (Fig. 2d). During learning, the output of the student ap- proaches the output of the teacher to finally converge af- ter learning. To test the performance of the model, we compare the scenario with learned B to using B = W T . Both scenarios are trained using only the signal shown in Fig. 2e and are tested on signals that consist of combina- tions of sines with randomly sampled frequencies within the frequency range of the training signal. Initially, as compared to learning B, the test loss using B = W T de- creases faster, because of the higher learning rate at each synapse, to eventually stagnate at a relatively high test loss. If we learn B, the test loss decreases slower initially, but eventually reaches a lower minimum. Thus, in this more complex task we can demonstrate that the synapse dependent learning rate obtained by learning B helps to converge to a better test performance. 5 a 0200400600 t[s] 10 −1 10 −3 10 −5 train loss Blearned B=W T b −1 0 1 r [ Hz ] c r in 0 r in 1 −2−10 t[s] 0 1 2 r [ Hz ] d 99100101 t[s] 600601602 t[s] target output Figure 3: Temporal XOR task. (a) The network consists of two laglines that perform the temporal computation and delay the input signal and one XOR layer that does the spatial computation. The target is generated by a teacher network with weights shown along the edges. (b) The training loss when learning B converges faster compared to using B = W T , plotted as a moving average to remove high frequency oscillations. (c) The two input signals, each with a different frequency, for different times during the simulation. (d) The output of the model with learned B (blue) compared to the target before, during and after training. Temporal XOR Next, we test the performance of our model on a more difficult temporal processing task: a continuous and tem- poral form of the XOR task, which in its binary variant usually requires to learn |x 1 − x 2 | for x 1 ,x 2 ∈ 0, 1. To make the task continuous, we use the rates r 0 ,r 1 ∈ (0, 1) as input and to make the task temporal, we require the network to learn the shifted difference of the two, such that the target becomes |r in 0 (t− ∆ 0 )−r in 1 (t− ∆ 1 )|. Here, r in 0/1 (t− ∆ 0/1 ) are the inputs of the network shifted by ∆ 0/1 . Moreover, the task requires the network to store some memory of previous states, or at the very least to use different sub-networks to encode information corre- sponding to different times. The two input signals have a different frequency and are shown for different times during the simulation (Fig. 3)c. Before training, the output of the student is very different from the target and converges throughout the training as shown in Fig. 3d. Learning B leads to a significantly faster convergence of the training loss compared to using B = W T . Since the sharp edges caused by the absolute value in the target lead to high frequency components in the error, the network used to learn this task benefits from the synapse specific learning of the backward weights that accounts for offsets in the gain that arise when using B = W T . 3 Discussion Biological plausibility The main thrust behind our method is to provide a systematic approximation to spa- tiotemporal learning that is local both in space and time, thus respecting physical and biological constraints. Ap- proaching the problem as the minimization of a local en- ergy functional ensures spatial locality. The outcome of variational optimization is easy to compute using elemen- tary calculus of variations, and requires information about future errors that is non-local-in-time. This is to be ex- pected, as true optimization of a temporal task requires perfect knowledge of future states. More specifically, in order to preserve causality, the network has to learn to predict the discounted future error. Because of its expo- nential weighting, the discounted future error is mostly sensitive to the near future, making it feasible to approx- imate it using local quantities. Here, we use the look- ahead operator as a first order approximation, which is also the basis of the GLE framework. In this work we go beyond this, by learning backward weights to compen- sate for the approximated future errors. This backward learning can be combined with methods developed to solve the problem of weight transport, most notably the PAL framework. All these ingredients can be assembled into a cortical microcircuit that implements biologically plausi- ble learning, and constitutes a blueprint for brain models and for neuromorphic applications. In this way, we have a full framework that learns forward and backward weights to achieve a local spatiotemporal approximation of the BPTT algorithm. Comparison to previous results This work builds upon and combines the insights stemming from sev- eral previous publications, specially (Generalized) Latent Equilibrium [1, 2] and Neuronal Least Action [5]. When considering GLE, our method is an extension in two senses. First, it presents GLE as the consequence of a variational optimization procedure followed by a linear approximation for the discounted future error. That is to say, VLE takes inspiration from physics and uses a global minimization principle as a starting point. This al- lows an elementary derivation of the equations governing spatiotemporal learning in systems of neurons, a much simpler analytical derivation of a result that is equiva- lent to the well-known AM. Physically speaking, the en- ergy we are minimizing carries information about the lo- cal “frustration” or mismatch at the level of individual neurons, caused by the difference between the input a neuron receives from a previous layer and the input it is expecting to receive, as represented by its prospective membrane voltage. This mismatch or error information is then transferred from the output layers downward, via the error signals through error neurons. Thus, the out- come of the method is compatible with GLE also at the 6 level of specific microcircuit implementations. Nonethe- less, the variational derivation gives firmer grounding to the whole method, at least in terms of technical acces- sibility. Secondly, VLE improves on GLE by correcting for this approximation, by training the backward weights during learning. This has the apparent disadvantage of doubling the number of trainable parameters, but can be important in situations where the linear approximations to the discounted future error can potentially break down. This is most notably the case when there are multiple high frequency components in a signal for which the erroneous gain modulation of GLE causes distorted responses. Note also that in order to achieve fully bio-plausible learning the backward weights should be learned in any case, by e.g. the PAL algorithm. Thus, for a fully biologically plausible implementation of BPTT, the backward learn- ing rule of VLE becomes a necessity. Our model is similar to NLA in taking inspiration from physics but diverges from it in important ways. The most crucial difference is that NLA cannot perform temporal credit assignment. Indeed, other than imposing a low- pass filter on its inputs, an NLA network effectively reacts instantaneously to external stimuli, and can neither carry out nor learn temporal sequence processing. In contrast, in VLE, prospectivity is not only taking place at the level of the forward “inference” neurons, but neurons in the backwards errors path can also be prospective, to reduce the lag caused by the membrane integration. Overall, there are two time constants that can be leveraged to per- form rich temporal processing, one associated with mem- brane integration, and a second governing prospective out- put rates. This central difference is what enables tempo- ral processing, and is the key to approximate the AM and BPTT. Additionally, the implicit microcircuit implemen- tation behind VLE is much more symmetric than the one from NLA, where errors are transmitted through the use of instantaneous inter-neurons. In our microcircuits, we make use of the same type of neurons for inference and errors, connected in a symmetric way (see Fig. 1a). Finally, it is important to emphasize that some of the as- sumptions in VLE, such as the strict connectivity scheme imposed by the derivation, can be substantially loosened to account for the great diversity found in biology. For this, we can take inspiration by the microcircuit imple- mentations recently developed by [20], which among other things show how methods such as GLE can be imple- mented in multiple biologically plausible microcircuits. Conclusions In this work we present VLE, a systematic approximation of the AM and BPTT that is local both in space and time. It thus lends itself both as a model of cortical computation, and as a framework for implement- ing BPTT within brain-inspired neuromorphic hardware. Our work improves on previous results by improving tem- poral error transmission in the backward pathway. For spatiotemporal tasks that require the processing of signals with a diverse frequency range, VLE enables enhanced stability and more accurate learning. Our experiments show that VLE can solve non-trivial temporal problems, which cannot be solved by purely spatial methods. For tasks that require the processing of complex signals, learn- ing the backwards weights, and thus providing a neuron specific learning rate that is modulated through the gain correction, leads to faster and better learning. 4 Methods 4.1 Theoretical derivation Detailed derivation of VLE equations The network of neurons is described by a global energy that consists of the sum of local error signals e i and a global cost C, Eqn. (2) E(t) = 1 2 X i e 2 i (t) + βC(t) The membrane potential of a neuron follows the leaky- integrator dynamics τ m i ̇u i =−u i + e i + X j W ij φ j ( ̆u r j ) + b i ,(17) where each neuron integrates bottom-up input φ j ( ̆u r j ) via synaptic weights W ij with bias b i and top-down errors e i . The output rate of lower layer neurons is defined by the activation φ j as a function of the prospective voltage ̆u r j = u j +τ r j ̇u j that is a look-ahead with time constant τ r j . Each neuron i will thus low-pass filter its inputs with its membrane time constant τ m i . This low-pass filter induces a lag of the order of τ m i that is partly reversed by the prospective output φ i ( ̆u r i ) which implements a temporal advance in the order of τ r i . Rearranging Eqn. (17) and using ̆u m i = u i + τ m i ̇u i yields the definition of the local mismatch errors e i = ̆u m i − X j W ij φ j ( ̆u r j )− b i .(18) The local mismatch error can be seen as the difference between what the neuron thinks it should be doing in the near future ( ̆u m i ) and what its predecessors think it should be doing. To minimize this local mismatch error along- side the global cost, we use a variational approach and minimize the integrated energy E = R t 2 t 1 dtE(t). Here, we use the integral over time to account for the general case of temporal processing tasks, which require to minimize the integrated cost C = R t 2 t 1 dtC(t) (see next section for details on this). To find the minima of such a functional E , a variational approach is well-suited. We consider small variations of 7 αη(t) of the membrane voltage such that u α (t) = u+αη(t) with η(t) being an arbitrary function that vanishes at the boundaries. Importantly, this variation of u subse- quently affects its temporal derivative such that we also get ̇u α (t) = ̇u + α ̇η(t). To consider variations of u in E, we note that with Eqn. (18) and assuming the cost C = C(u(t), ̇u(t)) to be function of the network output, we can write the energy as a function of the voltage u and its tempo- ral derivative ̇u, namely E(t) = E(u(t), ̇u(t)). Accord- ing to the calculus of variations, the integrated energy E α = R t 1 t 1 E(u α (t), ̇u α (t))dt has a minimum for α = 0 and we can write dE α dα α=0 = Z t 2 t 1 dE(u α (t), ̇u α (t)) dα α=0 dt.(19) The result yields the Euler-Lagrange equations 0 = ∂E ∂u i − d dt ∂E ∂ ̇u i ,(20) which we can now use to derive error-backpropagation in the neuronal dynamics: ∂E ∂u i = d dt ∂E ∂ ̇u i ∂E ∂ ̆u m i ∂ ̆u m i ∂u i + ∂E ∂ ̆u r i ∂ ̆u r i ∂u i = d dt ∂E ∂ ̆u m i ∂ ̆u m i ∂ ̇u i + d dt ∂E ∂ ̆u r i ∂ ̆u r i ∂ ̇u i (1−τ m i d dt ) ∂E ∂ ̆u m i =(1−τ r i d dt ) − ∂E ∂ ̆u r i (1−τ m i d dt ) z| e i =(1−τ r i d dt ) z| φ ′ i ( ̆u r i ) P j W ji e j be i m =φ ′ i ( ̆u r i ) P j W ji e j ˆ r . (21) Next, we apply the future-discount operator to both sides of the equation to get the future-discounted error: e i = ̃ε m i with ε i = φ ′ i X k W ki e k ˆ r .(22) This is the same result we get from the AM. Because the future-discounted error cannot be obtained without knowledge about the future, it is biologically not plau- sible. As a biologically plausible approximation, we use the prospective error instead. We further approximate Eqn. (22) towards biological plausibility by using, instead of the look-back operator, a low-pass filter that corre- sponds to the membrane integration of neurons. The re- sulting error dynamics e i = ̆ε m i with ε i = φ ′ i X k W ki e k r (23) recover the equations of GLE. The approximated error dy- namics are similar to the neuron dynamics of the forward path, as they low-pass filter input and use prospective out- put rates. An implementation into a neuronal circuit thus uses the same neuron model for forward and error neurons and yields a symmetric circuit as shown in Fig. 1a. Note that we omitted the external cost here, which only affects output neurons and is assumed to be a function of the output rates C = C( ̆u r (t)). Because output neurons do not project onto other neurons, i.e. W out ij = 0, their error can be written as ε i,out =−β ∂C( ̆u r ) ∂ ̆u r i,out r .(24) For supervised learning, a typical cost to minimize is the mean squared error C(t) = 1 2 (r trg − r out ), for which the target error of the output layer is given by ε i,out = β φ ′ ( ̆u r ) (r trg − φ( ̆u r )) r .(25) Minimizing the energy and minimizing the cost In VLE, the ultimate objective for the network is to min- imize the cost C, of which the integrated energy E is but a proxy. Taking inspiration from [4], we show how to re- cover cost minimization in the limit of small β. In order to show how this works explicitly, we go back to Equation Eqn. (2): E(t) = 1 2 P i e 2 i (t) + βC(t) Being a sum of quadratic terms, it is easy to see that in the case of zero nudging, β = 0, the trivial solution to δE = 0 is e i = 0∀i. By the same token, for small nudging strength β, the errors are also small, e = o(β). Because E sums up the squared errors, the energy for small β is dominated by the cost term, E(t) = βC(t) +o(β 2 ). Given that E = R t 1 t 0 E(t) dt, we have lim β→0 1 β E =C. Taking the total derivative with respect to a network parameter θ on both sides yields lim β→0 1 β dE dθ = dC dθ .(26) In order to see how this leads to gradient descent cost minimization, we explicitly calculate this total derivative of the functional E [u(θ)] dE dθ = Z t 2 t 1 dt δ u E du dθ + ∂E ∂θ = ∂E ∂θ ,(27) Where we used the Gateaux derivative for functional dif- ferentiation, and we are considering the derivative eval- uated at the optimized u trajectory (so-called on-shell). Since the global and partial derivatives of E with respect to θ are equal, we get lim β→0 1 β ∂E ∂θ = dC dθ . This means that minimizing the integrated energy with respect to the network parameters is equivalent to minimizing the inte- 8 grated cost. Notice that this on-shell calculation assumes compact variations, which vanish at the boundaries. This assump- tion can become incorrect when considering large varia- tions of network parameter from beginning to end of train- ing. Our result is only exact within an adiabatic approxi- mation, which considers slowly changing parameters that can be treated as constant functions. This represents a further, albeit common, approximation. With this, the learning rule Eqn. (5) then becomes approximately cost- gradient in the limit of small nudging, lim β→0 1 β ̇ θ ∝− lim β→0 1 β ∂E ∂θ ∝− dC dθ .(28) In the specific case of learning the forward weights, the partial derivative ∂E ∂W can be easily calculated since it is a local quantity, and in fact involves the local error in the apical dendrite. This way, the learning rule Eqn. (6) becomes approximately cost-gradient in the limit of small nudging Learning the backward weights We introduce back- ward weights B ik e i = ̆ε m i with ε i = φ ′ i X k B ik e k r .(29) The backward weights can learn to mirror their reciprocal forward synapses W ji through specific learning rules, e.g., PAL [16] E P ij =−ξ T i B ij ˆr j + α 2 (B ij ) 2 (30) ⇒ ̇ B ij ∼ ∂E P ij ∂B ij = ξ i ˆr T j − αB ij (31) where ξ i is a noise signal, the autocorrelation properties of which can be used to learn the correct backward weights. VLE uses this kind of mechanism to also deal with the causality (temporal) issue. By learning backward weights that compensate the gain offset in GLE, the network can better converge to the exact AM solution, while preserving spatial and temporal locality. To see how this works, we start by defining instanta- neous errors for AM, ε i = P j φ ′ j ( ̆u r j )W ji e j , and VLE, ε i = P j φ ′ j ( ̆u r j )B ij e j , to write down how they propagate: AM : e i = ̃ε m i with ε i = φ ′ i X k W ki e k ˆ r (32) VLE : e i = ̆ε m i with ε i =φ ′ i X k B ik e k r (33) To equalize the two, we can learn a B that compen- sates for the difference between the look-ahead and the discounted-future operators, so that e AM i ! = e VLE i . In fact we can consider the even stronger constraint: W ji ̃ε m j ! = B ij ̆ε m j (34) To achieve the above, we could simply use local gradient descent to minimize E B ij = 1 2 W ji ̃ε m j − B ij ̆ε m j 2 , obtain- ing ̇ B ij = W ji ̃ε m j − B ij ̆ε m j ̆ε m j . Unfortunately, such a learning rule still uses the discounted future ex m , which is unavailable to the network in real time. A bio-plausible solution can be found by applying the look-back operator to both sides of this learning rule. Namely, we can define the loss E B ij = W ji ˆ ̆ε j r r −B ij ˆ ̆ε j m m 2 (35) ⇒ ̇ B ij ∼− ∂E B ij ∂B ij = W ji ˆ ̆ε j r r −B ij ˆ ̆ε j m m ˆ ̆ε j m m (36) which only includes quantities available on-line, since both look-aheads and look-backs are causal. We used ε j = ˆ ̆ε j r r to compensate for the difference of the low-pass filter in VLE and the look-back operator in AM. To see that E B has the same minima as E B we simply apply the look-back operator to its argument (Eqn. (34)), which conserves the roots of ̇ B, i.e., the minima of E B . This proposal still suffers from the weight transport prob- lem, as is made evident by the presence of W ji in the learning rule for B ij . To solve this we proceed in two steps. First, neurons learn to align to the forward weights by harnessing intrinsic neuronal noise ξ k , as proposed in [16]: ̇ B ∗ ik ∼ ξ k ˆr k − αB ∗ ik . Then, neurons would learn to compensate the GLE gain distortion by minimizing the loss E B ik = B ∗ ik ˆ ̆ε k r r −B ik ˆ ̆ε k m m 2 .(37) The implementation of PAL via Eqn. (37) requires the use of larger microcircuits, which have already been deployed successfully by [16], and are thus not the focus of this work. Assuming that PAL works as designed, we only use Eqn. (16) in our simulations for simplicity. 4.2 Numerical implementation We use forward Euler integration to solve the set of cou- pled differential equations that define the dynamics of the network of neurons. While all computations work with vector and matrix calculation, we use neuron indices here to highlight locality. The membrane voltage of a neuron integrates the bottom-up input and top down error ∆u i (t) = ̇u i (t) = 1 τ m i (−u i (t) + X j W ij r j (t) + b i + e i (t)). (38) 9 Wethenusefinitedifferencestoapproximate ∆u i (t)≈ u i (t+dt)−u i (t) dt .The voltage at t + dt can then be computed by u i (t + dt) = u i (t) + dt∆u i (t).(39) The error neurons low-pass filter the top-down error and the derivative of their dynamics can be expressed similarly as ∆ε i (t) = ̇ε i (t) = 1 τ r i (−ε i (t) + φ ′ i ( ̆u r i (t)) X k W ki e k (t)) (40) and the error neuron membrane potential gets updated via ε i (t + dt) = ε i (t) + dt∆ε i (t).(41) The learning of the parameters follows the same scheme and in the case of the weights becomes W ij (t + dt) = W ij (t) + η W e i (t)r j (t),(42) with η W = dt τ W defined by the learning rate for W . To learn the backward weights according to Eqn. (36), we need to compute the look-ahead and look-back operator. Applying both operators sequentially to the error vari- able, i.e. ˆ ̆ε i r r = ε i − (τ r i ) 2 ̈ε i , requires second order temporal derivatives. The update rule for B then becomes B ij (t + dt) = B ij (t) + η B (W ji ˆ ̆ε j r r (t)− B ij ˆ ̆ε j m m (t)) ˆ ̆ε j m m (t), (43) with η B = dt τ B again being a learning rate and ˆ ̆ε j r r (t) = ε j (t)− (τ r j ) 2 ε j (t + dt)− 2ε j (t) + ε j (t− dt) dt 2 , (44) using the instantaneous input integrated in Eqn. (40) to get ε(t + dt). The second order derivative can also be expressed as finite difference of first order differentials ̈ε j (t) = ̇ε j (t)− ̇ε j (t− dt) dt . While we obtain ̇ε j (t) by rear- ranging Eqn. (40), our implementation needs to store one additional past value to compute the second order deriva- tive. We note that such workarounds when estimating temporal derivatives are a common issue when discretiz- ing differential equations that are meant to be continuous. Algorithm 1, adapted from [1], summarizes the system of equations as simulated in our experiments. 4.3 Experimental setup Lagline In this experiment, we set up a simple case of small-network learning to test how a student network Algorithm 1 Forward Euler simulation of VLE network initialize network parameters θ =W,B,b,τ m ,τ r initialize network states at t = 0: u(0), v(0), r(0), e(0) for time step t in [0,T ] with step size dt do for layer ℓ from 1 to L do if ℓ = L then calculate instantaneous target error: e inst L (t)← βφ ′ L (t)◦ (r trg (t)−r L (t)) else propagate feedback error signals: e inst ℓ (t)← φ ′ ℓ (t)◦B ℓ (t)e ℓ+1 (t) end if sum input currents: I ℓ (t) = W ℓ (t)r ℓ−1 (t) +b ℓ (t) + γe ℓ (t) approximate membrane potential derivatives: ∆u ℓ (t)← (τ m ℓ (t)) −1 ◦ (−u ℓ (t) +I ℓ (t)) update membrane potentials: u ℓ (t + dt)← u ℓ (t) + dt∆u ℓ (t) approximate error potential derivatives: ∆ε ℓ (t)← (τ r ℓ (t)) −1 ◦ −v ℓ (t) +e inst ℓ (t) update error potentials: ε ℓ (t + dt)← ε ℓ (t) + dt∆ε ℓ (t) update prospective error potentials: ε ℓ (t + dt)← ε ℓ (t) +τ m ℓ (t)◦ ∆ε ℓ (t) calculate prospective outputs: r ℓ (t + dt)← φ (u ℓ (t) +τ r ℓ (t)◦ ∆u ℓ (t)) update synaptic weights: W ℓ (t + dt)← W ℓ (t) + η W e ℓ (t)r T ℓ−1 (t) update backward weights: ˆ ̆ ε ℓ m m (t)← ε ℓ (t)− (τ m ℓ ) 2 (t)(∆ε ℓ (t)− ∆ε ℓ (t− dt))/dt ˆ ̆ ε ℓ r r (t)← ε ℓ (t)− (τ r ℓ ) 2 (t)(∆ε ℓ (t)− ∆ε ℓ (t− dt))/dt B ℓ (t + dt)← B ℓ (t) +η W (W T ˆ ̆ ε ℓ r r (t)−B ˆ ̆ ε ℓ m m (t)) ˆ ̆ ε ℓ r r (t) update biases: b ℓ (t + dt)← b ℓ (t) + η b e ℓ (t) update membrane time constants: τ m ℓ (t + dt)← τ m ℓ (t)− η τ e ℓ (t)◦ ∆u ℓ (t) update prospective time constants: τ r ℓ (t + dt)← τ r ℓ (t) + η τ e inst ℓ (t)◦ ∆u ℓ (t) end for end for learns the output of a teacher network with the same topology. Because of the simplicity and the choice of non- linearities, the task has as a unique solution to learn the teacher weights. The time constants of the neurons are τ m 0 = 0.4, τ m 1 = 0.2, τ r = dt and the teacher weights W ∗ 0 = 1 and W ∗ 1 = 2. The weights of the student are ini- tialized to the opposite of the teacher weights W 0 = −1 and W 1 = −2. We investigate three scenarios: learning the backward weights B, using B = W T and constant B = −W ∗T . The learning rates are η W = 0.01 and η B = 0.1, the nudging strength β = 0.5, and the simu- lation time step dt = 0.01. The input is a sine with fre- quency f = 1 Hz. We initialize the network for T init = 10s, train for T train = 1000 s and evaluate for T test = 10 s. All parameters where optimized using PyTorch’s SGD op- timizer and a mean squared error loss. Lagnet The network used in this experiment consists of one neuron in the input layer, two hidden layers with 10 two and three neurons and three output neurons, each with tanh activations. The input signal consists of mul- tiple overlayed sines with N frequency components f i ∈ [0.44, 0.55, 0.77, 1.3] and amplitudes a i ∈ [0.4, 0.3, 0.2, 0.2]. We correct for the amplitudes of the components such that the input signal becomes r 0 (t) = P i a i sin(2πf i ) 1 N P j a 2 j . The time constants of the networks are [0.3, 0.7] for the first layer, [0.1, 0.5, 0.9] for the second layer, [0.2, 0.4, 0.8] for the last layer and τ r = dt for all layers. To avoid high fre- quency fluctuations of the error signals, we additionally include a synaptic low-pass filter with a small time con- stant of τ s = 0.05 for the signal and error neurons. The Hyperparameters are η W = 0.01, η B = 0.1 with nudging strength β = 0.5. The performance of the trained model is evaluated on a batch of 32 test signals with 10 frequency components each, sampled uniformly in the range of the training signal between 0.44 to 1.3 Hz. The network was initialized for T init = 10 s, trained for T train = 10 000 s and evaluated for T eval = 10 s and the simulation time step is dt = 0.01. All parameters where optimized using PyTorch’s SGD optimizer and a mean squared error loss. Temporal XOR In this experiment, we want the net- work to learn a continuous and temporal version of the XOR task. The binary variant of XOR requires to learn |x 1 − x 2 | for x 1 ,x 2 ∈0, 1. We use the absolute value of the delayed input rates r 0 ,r 1 ∈ (0, 1) as target to make the task continuous and temporal. To avoid numerical er- rors that arise through the low-pass filter and prospective rates in the forward path, we use a teacher network to gen- erate the target signal. The network setup that is used to generate the target by the teacher and to learn the target by the student is shown with the corresponding teacher weights in Fig. 3a. While the teacher consists of two in- dependent laglines—the diagonal weights are zero—these weights are initialized and optimized by the student that has to adapt its parameters to learn the teacher output. The delay of the two input signals is accumulated in the lower layers that correspond to the lagline and are linear with time constants τ m = [0.4, 0.2] and τ r = [0.2, 0.1] for both layers. To compute the absolute value, the upper hidden layer and the output layer have a ReLU activation and are instantaneous with τ m = τ r = dt. The inputs to the network are two sine functions with frequencies f 0 = 0.78 Hz and f 1 = 1.53 Hz. The hyperparameters η W , η B and β were optimized for the two scenarios of learn- ing B and using B = W T , by using the gridsearch space [10 −1 , 10 −2 , 10 −3 , 10 −4 , 10 −5 ]. The weights were initial- ized with a normal distribution and B were initialized to W T . For training, the best performing parameters are shown, which are η W = 0.1, η B = 0.01 and β = 0.1 for learning B and β = 0.01 for using B = W T . The network was initialized for T init = 10 s, trained for T train = 600 s and evaluated for T eval = 10 s and the simulation time step is dt = 0.001 All parameters where optimized using PyTorch’s SGD optimizer and a mean squared error loss. Acknowledgements We gratefully acknowledge funding from the European Union under grant agreement #101147319 (EBRAINS 2.0). We would like to express particular gratitude for the ongoing support from the Mandred St ̈ark Foundation (MAP). Author Contributions F.B., S.B., W.S., and M.A.P conceived the core ideas and designed the project.P.H. developed and main- tained the GLE codebase, which served as the basis for the VLE codebase, developed by S.B. S.B. performed the simulations. All authors contributed to the develop- ment of the theory. F.B. and S.B. wrote a first version of the manuscript. All authors contributed to the final manuscript. References 1. Ellenberger, B. et al. Backpropagation through space, time and the brain. Nature Communications 17. issn: 2041-1723. http://dx.doi.org/10.1038/ s41467-025-66666-z (Dec. 2025). 2. Haider, P. et al. Latent Equilibrium: A Unified Learning Theory for Arbitrarily Fast Computation with Arbitrarily Slow Neurons in Advances in Neu- ral Information Processing Systems 34 (Curran As- sociates, Inc., 2021), 17839–17851. 3. Brandt, S., Petrovici, M. A., Senn, W., Wilmes, K. A. & Benitez, F. Prospective and Retrospective Coding in Cortical Neurons May 2024. arXiv: 2405. 14810 [q-bio]. 4. Scellier, B. & Bengio, Y. Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation. Frontiers in Computational Neu- roscience 11. issn: 1662-5188. http://dx.doi.org/ 10.3389/fncom.2017.00024 (May 2017). 5. Senn, W. et al. A neuronal least-action principle for real-time learning in cortical circuits. ELife 12, RP89674 (2024). 6. Pineda, F. J. Generalization of Back-Propagation to Recurrent Neural Networks. Physical Review Letters 59, 2229–2232 (Nov. 1987). 7. Werbos, P. J. Backpropagation through Time: What It Does and How to Do It. Proceedings of the IEEE 78, 1550–1560. issn: 1558-2256 (Oct. 1990). 11 8. Williams, R. J. & Zipser, D. A Learning Algo- rithm for Continually Running Fully Recurrent Neu- ral Networks. Neural Computation 1, 270–280. issn: 0899-7667, 1530-888X (June 1989). 9. Kelley, H. J. Gradient Theory of Optimal Flight Paths. ARS Journal 30, 947–954 (Oct. 1960). 10. Todorov, E. Optimal Control Theory (Dec. 2006). 11. Chachuat, B. Nonlinear and Dynamic Optimization: From Theory to Practice. Lecture notes, EPFL, 1– 187 (2007). 12. Kelley, H. J. Method of Gradients. Mathematics in Science and Engineering 5, 205–254 (1962). 13. Boyd, S. P. & Vandenberghe, L. Convex optimization (Cambridge university press, 2004). 14. Luenberger, D. G., Ye, Y., et al. Linear and nonlin- ear programming (Springer, 1984). 15. Lillicrap, T. P., Cownden, D., Tweed, D. B. & Ak- erman, C. J. Random Synaptic Feedback Weights Support Error Backpropagation for Deep Learning. Nature Communications 7, 13276. issn: 2041-1723 (Dec. 2016). 16. Max, K. et al. Learning Efficient Backprojections across Cortical Hierarchies in Real Time. Nature Ma- chine Intelligence 6, 619–630. issn: 2522-5839 (June 2024). 17. Akrout, M., Wilson, C., Humphreys, P., Lillicrap, T. & Tweed, D. B. Deep Learning without Weight Transport in Advances in Neural Information Pro- cessing Systems 32 (Curran Associates, Inc., 2019). 18. Guerguiev, J., Kording, K. P. & Richards, B. A. Spike-Based Causal Inference for Weight Alignment. arXiv:1910.01689. arXiv: 1910.01689 (2019). 19. Gierlich, T., Baumbach, A., Kungl, A. F., Max, K. & Petrovici, M. A. Weight Transport through Spike Timing for Robust Local Gradients Mar. 2025. arXiv: 2503.02642 [q-bio]. 20. Max, K., Jaras, I., Granier, A., Wilmes, K. A. & Petrovici, M. A. ‘Backpropagation and the brain’realized in cortical error neuron microcircuits. bioRxiv, 2025–07 (2025). 12