Paper deep dive
A Miniature Brain Transformer: Thalamic Gating, Hippocampal Lateralization, Amygdaloid Salience, and Prefrontal Working Memory in Attention-Coupled Latent Memory
Hong Jeong
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/13/2026, 12:31:34 AM
Summary
The paper introduces a 'miniature brain transformer' architecture that integrates four neurobiologically inspired modules—thalamic relay, amygdaloid salience, prefrontal working-memory (PFC) buffer, and cerebellar fast-path—into an attention-coupled latent memory framework. The study reveals that inhibitory callosal coupling alone is insufficient for functional lateralization of memory banks; instead, the PFC buffer acts as a critical symmetry-breaker, enabling a phase transition that leads to robust lateralization of episodic and rule-based memory.
Entities (6)
Relation Signals (3)
Prefrontal Working-Memory Buffer → enables → Functional Lateralization
confidence 95% · Functional lateralization requires the synergy of PFC and inhibition: only when the PFC buffer is added (variant 6) does a sharp, discontinuous phase transition fire
Inhibitory Callosal Coupling → insufficientfor → Functional Lateralization
confidence 95% · inhibitory callosal coupling alone never lateralizes the banks
Cerebellar Fast-Path → accelerates → Phase Transition
confidence 90% · The cerebellar fast-path accelerates the transition by one epoch
Cypher Suggestions (2)
Identify dependencies for functional lateralization · confidence 95% · unvalidated
MATCH (m:Module)-[:ENABLES]->(p:Phenomenon {name: 'Functional Lateralization'}) RETURN m.nameFind all modules integrated into the architecture · confidence 90% · unvalidated
MATCH (a:Architecture {name: 'Miniature Brain Transformer'})-[:HAS_MODULE]->(m:Module) RETURN m.nameAbstract
Abstract:We present a miniature brain transformer architecture that extends the attention-coupled latent memory framework with four additional brain-region analogues: a thalamic relay, an amygdaloid salience module, a prefrontal working-memory (PFC) buffer, and a cerebellar fast-path, all coupled by inhibitory callosal cross-talk between lateralized hippocampal banks. We evaluate on a two-domain benchmark -- MQAR (Multi-Query Associative Recall; episodic domain) and modular arithmetic (+1 mod 10; rule-based domain) -- using a seven-variant additive ablation. The central empirical finding is a surprise: inhibitory callosal coupling alone never lateralizes the banks (variants 1-5 maintain D_sep ~ 0.25 and P_ct ~ 0.25 for all 30 epochs). Functional lateralization requires the synergy of PFC and inhibition: only when the PFC buffer is added (variant 6) does a sharp, discontinuous phase transition fire -- at epoch 11 for the PFC-only variant and epoch 10 for the full model -- collapsing P_ct from 0.25 to ~0.002 and more than doubling D_sep from 0.251 to 0.501 in a single gradient step. The PFC buffer acts as a symmetry-breaker: its slowly drifting domain context creates the initial asymmetry that the inhibitory feedback loop then amplifies irreversibly. The cerebellar fast-path accelerates the transition by one epoch (epoch 10 vs. epoch 11) with no asymptotic change, confirming its convergence-acceleration role. The result constitutes a novel, falsifiable prediction -- no lateralization without working memory context -- and a principled, neurobiologically motivated blueprint for hierarchical persistent memory in sequence models.
Tags
Links
- Source: https://arxiv.org/abs/2603.07217v1
- Canonical: https://arxiv.org/abs/2603.07217v1
PDF not stored locally. Use the link above to view on the source site.
Full Text
61,955 characters extracted from source content.
Expand or collapse full text
A Miniature Brain Transformer: Thalamic Gating, Hippocampal Lateralization, Amygdaloid Salience, and Prefrontal Working Memory in Attention-Coupled Latent Memory Hong Jeong Department of Computer Information Engineering Inha University in Tashkent, Uzbekistan hjeong@postech.ac.kr Abstract We present a miniature brain transformer architecture that extends the attention-coupled latent memory framework of Jeong [12] with four additional brain-region analogues: a thalamic relay, an amygdaloid salience module, a prefrontal working-memory (PFC) buffer, and a cerebellar fast-path, all coupled by inhibitory callosal cross-talk between lateralized hippocampal banks. We evaluate on a two-domain benchmark—MQAR (Multi-Query Associative Recall [1]; episodic domain) and modular arithmetic (+1mod10+1 10; rule-based domain)—using a seven-variant additive ablation. The central empirical finding is a surprise: inhibitory callosal coupling alone never lateralizes the banks (variants 1–5 maintain sep≈0.25D_sep≈0.25 and ct≈0.25P_ct≈0.25 for all 30 epochs). Functional lateralization requires the synergy of PFC and inhibition: only when the PFC buffer is added (variant 6) does a sharp, discontinuous phase transition fire—at epoch 11 for the PFC-only variant and epoch 10 for the full model—collapsing ctP_ct from 0.250.25 to ≈0.002≈0.002 and more than doubling sepD_sep from 0.2510.251 to 0.5010.501 in a single gradient step. The PFC buffer acts as a symmetry-breaker: its slowly drifting domain context creates the initial asymmetry that the inhibitory feedback loop then amplifies irreversibly. The cerebellar fast-path accelerates the transition by one epoch (epoch 10 vs. epoch 11) with no asymptotic change, confirming its convergence-acceleration role. The result constitutes a novel, falsifiable prediction—no lateralization without working memory context—and a principled, neurobiologically motivated blueprint for hierarchical persistent memory in sequence models. 1 Introduction Memory in biological neural systems is not a monolithic store but a distributed network of anatomically and functionally specialized regions. The hippocampus, thalamus, amygdala, prefrontal cortex, and cerebellum each make distinct, non-redundant contributions to what we experience as unified memory and cognition [27, 4, 6, 11]. Their computational roles have been studied intensively: the hippocampus performs content-addressable associative recall; the thalamus gates sensory signals before cortical storage; the amygdala tags stimuli by emotional or motivational salience; the PFC maintains task-relevant context in working memory; the cerebellum provides fast error-correcting adaptation for procedural skills. Artificial memory-augmented networks [7, 29, 30] have largely adopted a flat, uniform-access architecture: a single external memory matrix read and written through differentiable attention. In [12], Jeong showed that introducing a single neuroscientific design principle—the inhibitory nature of callosal projections between hemispheres—is sufficient to drive robust functional lateralization: separate left and right memory banks specialise entirely in episodic vs. rule-based computation with zero between-bank routing error. (a) Standard TransformerLong Prompt Deep Transformer [4pt]many layers, [-1pt]large d Volatile Memory discardedafter each call(b) Brain TransformerShort Prompt Thin Transformer Encoder Persistent Memory(brain figure overlay)L BankepisodicR Bankruless=−1s=-1persists across calls Figure 1: Standard transformer vs. Brain transformer architecture. (a) A conventional deep transformer encodes the entire relevant context in a long input sequence on every forward pass; the resulting activations are volatile and are discarded at the end of each call. (b) Our brain-inspired architecture offloads long-term associative storage into persistent, lateralized hippocampal memory banks. The encoder itself can therefore remain thin, processing only a short prompt at inference time, while the memory banks accumulate a “big persistent brain” that survives across forward passes. Both variants are trained end-to-end with standard supervised learning. In this paper we ask: what happens when we incorporate the remaining brain-region analogues into the same mathematical framework? Figure 1 captures the central design motivation: persistent memory banks let a thin encoder behave as a much larger brain. Specifically we add four modules, each grounding a well-studied neurological function in the A⊤AVWA \!AVW write-back operator introduced in Jeong [12]: 1. Thalamic relay (Section 4): a gain-controlled input gate that modulates the proposal state update by the entropy of the current attention map, implementing the thalamus’s role as a selective relay that amplifies salient signals and suppresses noise. 2. Amygdaloid salience (Section 6): a scalar gate computed from the L2 norm of the retrieved context that up-weights consolidation of surprising or high-magnitude inputs, mirroring the amygdala’s role in affective memory tagging. 3. Prefrontal working memory (Section 7): an exponential moving average of proposal states that provides a slowly changing top-down context, capturing the PFC’s role in sustained task-relevant representations. 4. Cerebellar fast-path (Section 8): a momentum-enhanced bank update that accumulates gradient direction across steps, analogising the cerebellum’s inverse-model error correction for procedural learning. We evaluate on a two-domain symbolic benchmark comprising MQAR (episodic associative recall) and modular arithmetic (rule extraction), and conduct a seven-variant additive ablation that reveals an unexpected, mechanistically informative result. Contributions. 1. A principled decomposition of the A⊤AVWA \!AVW operator into five neuroscientifically motivated sub-circuits (Table 1). 2. Four new architectural modules compatible with standard backpropagation, adding ≈ 98,306 parameters (∼ 2.6%) over the inhibitory lateral base. 3. The surprising empirical finding that inhibitory callosal coupling alone does not lateralize the memory banks: all five variants without the PFC buffer remain permanently unlateralized (sep≈0.25D_sep≈ 0.25, ct≈0.25P_ct≈ 0.25) across all 30 epochs. 4. PFC-inhibition synergy as the necessary condition for lateralization: adding the PFC buffer triggers a sharp pitchfork bifurcation at epoch 11 (+pfc) / epoch 10 (full) in which sepD_sep jumps from 0.2510.251 to 0.5010.501 and ctP_ct collapses from 0.2520.252 to 0.0020.002. 5. A falsifiable mechanistic prediction: lateralization requires working-memory context to break the symmetric equilibrium; inhibitory amplification alone is insufficient. 2 Background and Related Work This section situates the paper within three intersecting research threads. We first summarise the attention-coupled latent-memory operator that forms our mathematical substrate, then survey brain-inspired architectures that motivate our module choices, and finally review memory-augmented networks that inform our evaluation design. 2.1 Attention-Coupled Latent Memory Our starting point is the architecture introduced in Jeong [12]. Given encoder output Zt∈ℝn×dZ_t ^n× d, a shared proposal state Pt−1∈ℝp×dP_t-1 ^p× d, and lateralized memory banks Lt,Rt∈ℝm×dL_t,R_t ^m× d, the core update is: Bt=γBt−1+Ab⊤AbVbWbb+sAb⊤AbVb¯Wb¯b,b∈l,r,B_t\;=\;γ B_t-1\;+\;A_b A_bV_bW_b\;+\;s\,A_b A_bV_ bW_ bb, b∈\l,r\, (1) where AbA_b is the cross-attention map from proposal to bank b, VbV_b the corresponding values, WbbW_b the ipsilateral write matrix, and s∈−1,0,+1s∈\-1,0,+1\ controls callosal cross-talk. Setting s=−1s=-1 (inhibitory) was shown to achieve saturated lateralization (sep=±1.00D_sep=± 1.00, ct≈0P_ct≈ 0). 2.2 Brain-Inspired Neural Architectures Neuroscience-guided design has a long history in neural networks. Convolutional networks drew on the visual cortex [15]; LSTM gates were inspired by dendritic integration [10]; predictive coding architectures [23] formalize top-down and bottom-up cortical signals. More recently, hippocampal indexing theory informed memory-indexed transformers [24], and neuromodulatory signals have been used to gate attention in meta-learning systems [19]. Our approach is distinguished by its systematic mapping of five brain regions to five sub-circuits within a single forward pass, grounded in the A⊤AVWA \!AVW operator. Kumaran et al. [14] relate hippocampal complementary learning systems to fast vs. slow learning in deep networks, anticipating our two-timescale design. O’Reilly and McClelland [20] propose the hippocampal indexing theory, in which the hippocampus stores indices to cortical patterns rather than the patterns themselves—a role analogous to our bank keys. Botvinick et al. [2] review dopaminergic mechanisms in reinforcement learning networks, directly relevant to our discussion of missing basal ganglia and neuromodulatory modules. Predictive coding transformers [23, 8] formalize cortical prediction and error signals, anticipating our cerebellar fast-path. Neuromodulatory attention [19] uses Hebbian trace-modulated weights, which partially overlaps our Gram-matrix write-back. Our work differs in providing a principled system-level mapping (all five regions, one algorithm) rather than transplanting individual mechanisms. 2.3 Memory-Augmented Networks Neural Turing Machines [7] introduced addressable external memory but require separate read/write heads and do not exploit Gram-matrix consolidation. Memorizing Transformers [30] cache past key-value pairs but do not partition memory by cognitive mode. Structured-state-space models (Mamba [5]) compress history into implicit states, which prevents inspectable, bank-wise specialization. Our architecture retains explicit, persistent, bank-wise memory slots while adding the cognitive modularity absent from all prior approaches. MemGPT [21] implements an OS-like memory hierarchy with explicit main memory and disk tiers, a functional counterpart to the working-memory (PFC) and long-term storage (hippocampus) distinction we make here. Table 1: Mapping between biological brain regions and architectural components. Each module has a direct mathematical role within the A⊤AVWA \!AVW forward pass. Brain region Biological function Architectural analogue Mathematical role Sensory cortex Input encoding Transformer encoder ZtZ_t Zt∈ℝn×dZ_t ^n× d Thalamus Relay & gain control Proposal gate gtg_t, proposal state PtP_t Gain modulation of Ap⊤ApVpWpA_p A_pV_pW_p Hippocampus (L) Episodic/associative memory Left bank LtL_t Al⊤AlVlWllA_l A_lV_lW_l Hippocampus (R) Semantic/rule memory Right bank RtR_t Ar⊤ArVrWrrA_r A_rV_rW_r Corpus callosum Inter-hemispheric relay Cross-bank matrices Wlr,WrlW_lr,W_rl, sign s sAb⊤AbVb¯Wb¯bs\,A_b A_bV_ bW_ bb Amygdala Salience / affect gate Salience scalar sts_t Weighting of consolidation update Prefrontal cortex Working memory / executive PFC buffer FtF_t EMA of proposal context CpC_p Cerebellum Error correction / skill Momentum term Δt _t Accumulated write-direction velocity 3 Foundation: Attention as a Latent-Memory Operator We briefly recap the base architecture to establish notation. Table 2 summarises all recurring symbols used throughout the paper. Table 2: Notation. All symbols used in equations throughout the paper. Learned parameters are marked † ; running/state buffers are marked ⋆ . Symbol Shape Description Tensors and states ZtZ_t n×dn× d Encoder output at step t (sensory cortex) PtP_t p×dp× d Proposal state (thalamo-cortical buffer) Lt,RtL_t,\,R_t m×dm× d Left / right hippocampal memory banks FtF_t n×dn× d PFC working-memory buffer (EMA context) Δt(b) _t^(b) m×dm× d Cerebellar momentum accumulator for bank b CpC_p n×dvn× d_v Retrieved proposal context =ApVp=A_pV_p ApA_p n×pn× p Proposal cross-attention map Al,ArA_l,\,A_r p×mp× m Left / right lateral attention maps Dimensions n — Input sequence length d — Model dimension (dmodel=256d_model=256) dkd_k — Query / key dimension (dk=64d_k=64) dvd_v — Value / context dimension (dv=64d_v=64) p — Number of proposal slots (p=32p=32) m — Number of lateral memory slots (m=16m=16) Learnable weight matrices († ) WQ,WK,WVW_Q,W_K,W_V d×dkd\!×\!d_k / d×dvd\!×\!d_v Proposal query, key, value projections WpW_p dv×d_v× d Proposal write-back projection WKl,WKrW_K_l,W_K_r d×dkd× d_k Key projections for left / right banks WVl,WVrW_V_l,W_V_r d×dvd× d_v Value projections for left / right banks Wll,WrrW_l,W_r dv×d_v× d Ipsilateral callosal write matrices Wlr,WrlW_lr,W_rl dv×d_v× d Contralateral callosal write matrices WFW_F dv×d_v× d PFC context projection WF→QW_F→ Q d×dkd× d_k PFC top-down query bias projection WF-outW_F-out d×d× d PFC output residual projection Learnable scalar parameters († ) γ∈(0,1)γ∈(0,1) — Memory decay / forgetting rate wg,bgw_g,\,b_g — Thalamic gate weight and bias wsw_s — Amygdala salience slope βF∈(0,1) _F∈(0,1) — PFC EMA update gate (persistence control) αF _F — PFC top-down modulation gain ρ∈(0,1)ρ∈(0,1) — Cerebellar momentum coefficient ηcb _cb — Cerebellar fast-path learning rate Running state buffers (⋆ ) μs _s — Amygdala running-mean salience baseline Fixed constants ρs=0.99 _s=0.99 — Amygdala EMA momentum (fixed) s∈−1,0,+1s∈\-1,0,+1\ — Callosal sign: inhibitory / off / excitatory Hmax=logpH_ = p — Maximum attention entropy over p slots Let Zt∈ℝn×dZ_t ^n× d denote the encoder output, where n is the sequence length and d is the model dimension. Cross-attention to the previous proposal state Pt−1∈ℝp×dP_t-1 ^p× d via: Qp Q_p =ZtWQ,Kp=Pt−1WK,Vp=Pt−1WV, =Z_tW_Q, K_p=P_t-1W_K, V_p=P_t-1W_V, (2) Ap A_p =softmax(QpKp⊤dk)∈ℝn×p,Cp=ApVp∈ℝn×dv. =softmax\! ( Q_pK_p d_k ) ^n× p, C_p=A_pV_p ^n× d_v. (3) The proposal state is updated via the Gram write-back: Pt=γPt−1+Ap⊤ApVpWp.P_t=γ P_t-1+A_p A_pV_pW_p. (4) Here γ∈(0,1)γ∈(0,1) is a learnable memory-decay scalar (forgetting rate); Ap⊤Ap∈ℝp×pA_p A_p ^p× p is the Gram matrix acting as a co-activation router; and Wp∈ℝdv×dW_p ^d_v× d projects pooled evidence into the task-aligned subspace. Slots that jointly attend to the same input token cluster reinforce one another (Hebbian binding in latent space), while WpW_p shapes the pooled evidence into the task-loss-aligned subspace. This tripartite projection (observation → latent → supervised) is the architectural nucleus around which all five brain modules are attached. 4 Module 1 — Thalamic Relay: Input Gating and Gain Control The thalamic relay is the first module added to the base hippocampal architecture. It intervenes at the proposal-state write-back, scaling the consolidation gain by a scalar gate derived from the entropy of the current attention map. High-entropy (unfocused) attention suppresses noisy writes; low-entropy (focused) attention opens the gate, allowing high-confidence encodings to consolidate fully. We first motivate this design from neuroscience and then derive the formal operator. 4.1 Biological motivation The thalamus is not a passive relay; it regulates the gain at which cortical input is forwarded to downstream stores. Thalamo-cortical loops modulate attention and sensory salience via burst and tonic firing modes [25]: in tonic mode, the relay passes incoming signals faithfully; in burst mode, it amplifies novel or strong inputs. A key function is attentional gating: thalamo-cortical projections can sharpen or suppress sensory representations before hippocampal storage. 4.2 Formalization We implement thalamic gain control by computing the entropy of the proposal attention map Ap∈ℝn×pA_p ^n× p and using it to derive a scalar gain: Ht=−1n∑i=1n∑j=1p(Ap)ijlog(Ap)ij,gt=σ(wg⋅(Hmax−Ht)+bg),H_t=- 1n _i=1^n _j=1^p(A_p)_ij (A_p)_ij, g_t=σ\! (w_g·(H_ -H_t)+b_g ), (5) where Hmax=logpH_ = p is the maximum entropy over p slots, so gt→1g_t→ 1 when attention is sharply focused (low entropy, tonic burst mode) and gt→0g_t→ 0 when attention is diffuse (high entropy, tonic relay mode). The proposal update becomes: Pt=γPt−1+gt⋅Ap⊤ApVpWp.P_t=γ P_t-1+g_t· A_p A_pV_pW_p. (6) Intuitively: when the encoder produces a sharp, focused attentional pattern over the proposal slots, the thalamic gate amplifies that consolidation signal. Diffuse, uncertain attention produces a near-zero update, preventing noisy inputs from corrupting proposal state. Parameter cost. Two learnable scalars wg,bgw_g,b_g; negligible. 5 Module 2 — Hippocampal Memory with Callosal Inhibition We retain the full lateralized update from Jeong [12] unchanged, as it forms the core memory substrate. The proposal state PtP_t (thalamic output) routes into lateralized left and right hippocampal banks via joint softmax over concatenated keys: Qlr Q_lr =PtWQ,Klr=[Lt−1WKlRt−1WKr]∈ℝ2m×dk, =P_tW_Q, K_lr= bmatrixL_t-1W_K_l\\ R_t-1W_K_r bmatrix ^2m× d_k, (7) Alr A_lr =softmax(QlrKlr⊤dk)∈ℝp×2m. =softmax\! ( Q_lrK_lr d_k ) ^p× 2m. (8) Here WKl,WKr∈ℝd×dkW_K_l,W_K_r ^d× d_k are separate key-projection matrices for the left and right banks, allowing each bank to expose a distinct key space. Partitioning AlrA_lr into left slice Al∈ℝp×mA_l ^p× m (first m columns) and right slice Ar∈ℝp×mA_r ^p× m (last m columns), and with value tensors Vl=Lt−1WVlV_l=L_t-1W_V_l, Vr=Rt−1WVrV_r=R_t-1W_V_r (WVl,WVr∈ℝd×dvW_V_l,W_V_r ^d× d_v are learned value projections), and callosal sign s∈−1,0,+1s∈\-1,0,+1\: Lt L_t =γLt−1+Al⊤Al(VlWll+sVrWrl), =γ L_t-1+A_l A_l(V_lW_l+s\,V_rW_rl), (9) Rt R_t =γRt−1+Ar⊤Ar(VrWrr+sVlWlr). =γ R_t-1+A_r A_r(V_rW_r+s\,V_lW_lr). (10) The write matrices Wll,Wrr∈ℝdv×dW_l,W_r ^d_v× d accumulate ipsilateral (same-bank) evidence, while Wrl,Wlr∈ℝdv×dW_rl,W_lr ^d_v× d transfer contralateral (cross-bank) context; their contribution is scaled by sign s, which is −1-1 (inhibitory), 0 (disabled), or +1+1 (excitatory). Setting s=−1s=-1 (inhibitory callosal cross-talk) causes the dominant bank to actively subtract the contralateral bank’s influence during consolidation, sharpening the functional boundary between episodic (L) and rule-based (R) memory. CA3/CA1 correspondence. Within the hippocampal formation, CA3 performs pattern completion (associative recall from partial cues) while CA1 performs pattern separation (encoding new episodes with minimal overlap). In our architecture, the Gram matrix Ab⊤AbA_b A_b corresponds to CA3 completion: slots that are jointly activated are reinforced together. The write matrix WbbW_b corresponds to CA1-mediated novelty encoding: supervised gradients push the write direction toward task-relevant features, separating representations in the bank. 6 Module 3 — Amygdaloid Salience Gate The amygdaloid gate is the third module, acting downstream of the thalamic relay and the hippocampal banks. It scales the entire consolidation step by a running-mean-normalised salience signal, ensuring high-magnitude retrievals are preferentially consolidated while routine, low-salience inputs are written weakly into memory. Biological motivation comes first; the formal gate definition follows. 6.1 Biological motivation The amygdala modulates memory consolidation in proportion to the emotional or motivational salience of an experience [4, 18]. High-salience events trigger norepinephrine release, which strengthens hippocampal long-term potentiation; low-salience events are consolidated weakly. The result is an adaptive memory that prioritises surprising, consequential information. 6.2 Formalization We define salience as the normalized Frobenius norm of the retrieved proposal context: st=σ(ws⋅(‖Cp‖Fndv−μs)),s_t=σ\! (w_s· ( \|C_p\|_Fn d_v- _s ) ), (11) where μs _s is an exponential moving average of ‖Cp‖F/(ndv)\|C_p\|_F/(n d_v), updated as μs←ρsμs+(1−ρs)‖Cp‖F/(ndv) _s← _s _s+(1- _s)\|C_p\|_F/(n d_v) with fixed momentum ρs=0.99 _s=0.99. The only learnable parameter is ws∈ℝw_s ; μs _s is a running state, not a parameter. The salience gate modulates both the proposal update and the bank consolidation: Pt=γPt−1+gt⋅st⋅Ap⊤ApVpWp,P_t=γ P_t-1+g_t· s_t· A_p A_pV_pW_p, (12) Bt=γBt−1+st⋅(Ab⊤Ab(VbWbb+sVb¯Wb¯b)),b∈l,r.B_t=γ B_t-1+s_t· (A_b A_b(V_bW_b+s\,V_ bW_ bb) ), b∈\l,r\. (13) When input is routine (st≈0s_t≈ 0), the memory barely updates; when input is surprising or high-magnitude (st≈1s_t≈ 1), consolidation proceeds at full strength. This prevents the memory from being overwhelmed by repetitive low-information inputs while ensuring that genuinely novel content is reliably stored. Connection to the MQAR task. In our benchmark, the episodic (left) domain requires exact recall of 32 arbitrary key-value pairs [1]. First exposure to a key produces a high context norm (no prior memory to retrieve), driving st→1s_t→ 1 and full consolidation. Repeated exposures to already-learned keys approach the running mean μs _s, driving st→0.5s_t→ 0.5 and partial consolidation. This adaptive schedule naturally mirrors spaced-repetition learning dynamics. Parameter cost. One learnable scalar ws∈ℝw_s ; the running baseline μs _s is a state buffer and ρs=0.99 _s=0.99 is a fixed constant. Negligible parameter overhead. 7 Module 4 — Prefrontal Working Memory Buffer The prefrontal buffer is the module whose presence or absence determines whether lateralization occurs (Section 11). It introduces a slowly drifting top-down context that, over approximately nine epochs, builds up sufficient asymmetry to break the symmetric equilibrium maintained by callosal inhibition alone. We describe the neuroscience basis then formulate the EMA update and query-bias mechanism. 7.1 Biological motivation The prefrontal cortex maintains task-relevant representations in working memory over multi-second timescales, enabling context-dependent sequencing of behaviour even in the absence of immediate sensory cues [6]. PFC neurones exhibit persistent activity: once a task rule or context is loaded into working memory, it remains active until the task is completed or the context changes. PFC top-down projections then modulate posterior cortical areas and the hippocampus, biasing recall toward task-relevant content. 7.2 Formalization We implement the PFC buffer as a gated exponential moving average over the retrieved proposal context CpC_p: Ft=(1−βF)Ft−1+βFtanh(CpWF),F_t=(1- _F)\,F_t-1+ _F\, (C_pW_F), (14) where βF∈(0,1) _F∈(0,1) is the update gate (a learned parameter controlling persistence) and WF∈ℝdv×dW_F ^d_v× d projects context into a d-dimensional PFC representation space (d=dmodel=256d=d_model=256). Ft∈ℝn×dF_t ^n× d thus retains a slowly decaying trace of recent proposal contexts, mimicking persistent PFC activity. Top-down modulation. The PFC buffer modulates the lateral bank queries with an additive top-down bias: Qlrmod=PtWQ+σ(αF)FaggWF→Q,Q_lr^mod=P_tW_Q+σ( _F)\,F_aggW_F→ Q, (15) where αF _F is a learnable scalar gain (passed through σ to keep the modulation non-negative), Fagg∈ℝp×dF_agg ^p× d is FtF_t mean-pooled over the token dimension and expanded to proposal size, and WF→Q∈ℝd×dkW_F→ Q ^d× d_k projects into query space. This biases the joint softmax AlrA_lr toward bank slots that are consistent with the current PFC context, implementing goal-directed retrieval. Parameter cost. Gate βF _F, gain αF _F, projections WF∈ℝdv×dW_F\!∈\!R^d_v× d, WF→Q∈ℝd×dkW_F→ Q\!∈\!R^d× d_k, and output projection WF-out∈ℝd×dW_F-out\!∈\!R^d× d. With d=256d=256, dv=64d_v=64, dk=64d_k=64: total dvd+ddk+d2=64×256+256×64+2562=98,304d_vd+d\,d_k+d^2=64×256+256×64+256^2=98,304 parameters (≈ 2.6% overhead over the base lateral model). 8 Module 5 — Cerebellar Fast-Path for Momentum Write-Back The cerebellar fast-path is the final module, adding a momentum accumulator to the bank write-back that carries the gradient direction across consecutive steps. Its primary measured effect is to accelerate the PFC-induced phase transition by exactly one epoch: the full model lateralises at epoch 10 whereas the ++PFC-only variant requires until epoch 11. We motivate the design from cerebellar forward-model theory before presenting the formal momentum update rule. 8.1 Biological motivation The cerebellum computes forward models that predict the sensory consequences of motor commands and generates error signals to correct deviations [11]. During sequence learning, cerebellar circuits rapidly adapt motor programs based on prediction errors, producing smooth, high-frequency adjustments that complement the slower, episode-level consolidation performed by the hippocampus. Cerebellar learning is fast and local; hippocampal learning is slow and distributed. 8.2 Formalization We implement the cerebellar fast-path as a momentum accumulator over the bank write-back update: Δt(b)=ρΔt−1(b)+(1−ρ)Ab⊤AbVbWbb, _t^(b)=ρ\, _t-1^(b)+(1-ρ)\,A_b A_bV_bW_b, (16) where ρ∈(0,1)ρ∈(0,1) is the cerebellar momentum coefficient. The bank update (Equation 9/10) is then modified to: Bt=γBt−1+st⋅(Ab⊤Ab(VbWbb+sVb¯Wb¯b))+ηcbΔt(b),B_t=γ B_t-1+s_t· (A_b A_b(V_bW_b+s\,V_ bW_ bb) )+ _cb\, _t^(b), (17) where ηcb _cb is a learnable cerebellar learning rate. The momentum term Δt(b) _t^(b) carries the accumulated gradient direction of the write-back: on the first encounter with a new mapping, the write-back pushes the bank state in a certain direction; on subsequent encounters, the momentum amplifies that direction, accelerating convergence analogously to Adam’s first-moment estimate [13]. Relation to Adam. The cerebellar update is mathematically equivalent to the first moment of Adam with ρ as the β1 _1 parameter, applied at the level of the memory state rather than the model parameters. This distinguishes it from standard optimizer momentum: the cerebellar buffer accumulates evidence across sequence steps at inference time, not across gradient steps at training time. Fast vs. slow consolidation. The hippocampal update (Equation 13) is gated by salience sts_t and controlled by the long-term decay γ: it encodes which information matters (salience) and how long it is retained (γ). The cerebellar term adds a third timescale: how fast convergence occurs for repeated patterns. Together, the three timescales (salience gating, decay rate, momentum accumulation) correspond to the three stages of real memory formation: encoding, storage, and consolidation [18]. Parameter cost. Scalar ρ and ηcb _cb; negligible. 9 Unified Forward Pass: The Miniature Brain Algorithm LABEL:alg:forward consolidates all five modules into a single forward pass. Figure 2 illustrates the complete architecture. Encoder ZtZ_t(sensory cortex)Thalamic Gate gtg_t(thalamus)Proposal State PtP_t(thalamo-cortical)Left Bank LtL_t(hippocampus L)Right Bank RtR_t(hippocampus R)Salience sts_t(amygdala)PFC Buffer FtF_t(prefrontal)Momentum Δt _t(cerebellum)HtH_tgtg_tAl,VlA_l,V_lAr,VrA_r,V_r‖Cp‖\|C_p\|inhibitsts_tαF _F forward inhibit modulate Figure 2: The miniature brain architecture. Five neuroscientifically motivated modules are connected through the A⊤AVWA \!AVW write-back operator. Solid arrows show the primary information flow; dashed red arrows show callosal inhibitory cross-talk; dotted blue arrows show modulatory signals. Table 3 summarises the parameter overhead of each new module. Table 3: Complete set of learnable parameters added by the four new modules over the base lateral model. Full model total: 3,716,6153,716,615 parameters. Module New parameters % overhead Thalamic gate (wg,bgw_g,b_g) 2 <<0.01% Amygdaloid salience (wsw_s) 1 <<0.01% PFC buffer (WF∈ℝdv×d,WF→Q∈ℝd×dk,WF-out∈ℝd×d,βF,αFW_F\!∈\!R^d_v× d,\,W_F→ Q\!∈\!R^d× d_k,\,W_F-out\!∈\!R^d× d,\, _F, _F) dvd+ddk+d2+2d_vd+d\,d_k+d^2+2 ∼ 2.6% Cerebellar fast-path (ρ,ηcbρ, _cb) 2 <<0.01% Total overhead ≈ 98,306 ∼ 2.6% Full model 3,716,615 ∼ 2.6% vs. base 10 Geometric Interpretation The A⊤AVWA \!AVW operator admits a clean geometric decomposition that illuminates the role of each brain module. Tripartite projection (thalamus → hippocampus). For the proposal state: (Vp)→Ap(Cp∈ℝn×dv)→Ap⊤(Ap⊤ApVp∈ℝp×dv)→Wp(Pt∈ℝp×dp)(V_p) A_p(C_p ^n× d_v) A_p (A_p A_pV_p ^p× d_v) W_p(P_t ^p× d_p). The thalamic gain gtg_t scales the second projection: when gt≈0g_t≈ 0, the Gram matrix provides no signal (thalamus gates out noise); when gt≈1g_t≈ 1, the full Gram matrix drives consolidation. Salience as projection norm weighting. The amygdaloid gate sts_t rescales the entire write-back Ab⊤AbVbWbbA_b A_bV_bW_b by a single scalar. Geometrically, this is equivalent to modulating the length of the consolidated update vector in bank-state space: salient inputs produce longer update steps; routine inputs produce shorter ones. Because sts_t is derived from ‖Cp‖F\|C_p\|_F, it measures the magnitude of retrieved context—a scalar proxy for how much bank content was activated by the current input. High recall norm implies the memory already stores something relevant; low norm implies the input is novel. This is the inverse of standard salience (novel = low recall = high surprise), and so produces more consolidation for novel inputs, exactly as the amygdala directs. PFC as a query-space translation. The PFC modulation Qlrmod=PtWQ+σ(αF)FaggWF→Q_lr^mod=P_tW_Q+σ( _F)F_aggW_F→ Q translates the query origin in the joint key space. This biases the joint softmax AlrA_lr by an additive vector in query space, shrinking the effective attention entropy for inputs consistent with the PFC context and increasing it for inconsistent inputs. Because the thalamic gate amplifies low-entropy proposal attention (Eq. 5), PFC modulation and thalamic gating compose synergistically: PFC sharpens the query; thalamic gain amplifies the sharpened signal. Cerebellar momentum as write-direction alignment. The momentum term Δt(b) _t^(b) accumulates the direction of past Ab⊤AbVbWbbA_b A_bV_bW_b vectors. If the same slot receives consistent updates across steps (e.g. a persistent MQAR key-value association), the momentum term amplifies those updates, accelerating convergence. If updates are inconsistent (noise), the exponential moving average cancels them. This is equivalent to the Adam first moment [13] applied in bank-state space rather than parameter space. Bifurcation dynamics: PFC as symmetry-breaker, inhibition as amplifier. The ablation reveals that the callosal inhibitory coupling alone does not trigger a phase transition: all five variants without the PFC buffer remain permanently at the symmetric fixed point (sep≈0.25D_sep≈ 0.25, ct≈0.25P_ct≈ 0.25) across 30 epochs. The inhibitory signal Ltinh=Lt−α(RtWRL)L_t^inh=L_t-α(R_tW_RL) produces a zero-net perturbation when both banks hold symmetric states: the callosal term then cancels itself and the system cannot self-escape. The PFC buffer provides the missing ingredient. Its EMA FtF_t accumulates a slowly drifting domain-context representation over epochs 1–9; because MQAR and arithmetic inputs produce systematically different proposal states, FtF_t diverges between the two domains. This divergence enters the joint softmax AlrA_lr through the query-space translation σ(αF)FaggWF→Qσ( _F)F_aggW_F\!→\!Q, creating a small but growing asymmetry in the cross-bank attention weights. At epoch 10 (full model) this PFC-induced asymmetry crosses the critical threshold at which the callosal inhibitory gain exceeds unity: the inhibitory loop now amplifies rather than cancels, the symmetric equilibrium destabilizes, and the system snaps into the lateralized attractor in a single update step. This is a pitchfork bifurcation [26] driven by the product of two subsystems: the PFC (slow symmetry-breaking drift) and callosal inhibition (fast amplifying feedback). The cerebellar fast-path aligns consecutive routing gradients, accelerating the PFC drift and advancing the threshold crossing by one epoch (epoch 10 vs. epoch 11 for the +PFC-only variant). 11 Experiments We evaluate the Miniature Brain Transformer with a seven-variant controlled ablation study, progressively enabling each brain module to isolate its individual and combined contribution to lateralization and task accuracy. The section is organised as follows: §11.1 describes the benchmark tasks; §11.2 defines the evaluation metrics; §11.3–6 report the main results, module analysis, phase transition, and module necessity; and §11.7 discusses scaling to natural language. 11.1 Tasks and Datasets We evaluate on two synthetic tasks over a shared vocabulary of 512 tokens (PAD, BOS, DOMAIN_L, DOMAIN_R, SEP, QUERY, 10 digit tokens, and key/value tokens): • DatasetlDataset_l (MQAR, left bank): Multi-Query Associative Recall [1]. Each sequence presents 32 key-value pairs in the left domain, followed by 4 query tokens; the model must retrieve the value for each queried key. Train / val: 50k / 5k sequences. • DatasetrDataset_r (Modular Arithmetic, right bank): arithmetic progression +1(mod10)+1 10 in the right domain (rule extraction). Train / val: 50k / 5k sequences. • DatasetlrDataset_lr (Mixed): interleaved sequences drawn uniformly from both domains, separated by a SEP token. Train / val: 100k / 10k sequences. All models share an identical embedding layer, 4 encoder layers, dmodel=256d_model=256, 4 attention heads, ml=mr=16m_l=m_r=16 lateral bank slots, and p=32p=32 proposal slots, with dk=dv=64d_k=d_v=64 and γ=0.9γ=0.9. Training uses AdamW (lr =3×10−4=3\!×\!10^-4, weight decay 10−210^-2) for 30 epochs with a routing auxiliary loss ℒrouteL_route (weight λ=2.0λ=2.0) and gradient clipping at 1.0. We compare the following seven variants in an additive ablation: 1. Baseline: standard autoregressive Transformer (no lateral memory). 2. Lateral, inhibitory: attention-coupled lateral model with callosal inhibition (s=−1s=-1), no brain modules. 3. Lateral, excitatory: same with callosal excitation (s=+1s=+1). 4. +Thalamus: inhibitory lateral + thalamic gain gate (Section 4). 5. +Amygdala: above + amygdaloid salience gate (Section 6). 6. +PFC: above + prefrontal working memory buffer (Section 7). 7. Full (Miniature Brain): above + cerebellar fast-path (Section 8). 11.2 Evaluation Metrics We report the following metrics on the mixed dataset DlrD_lr (unless noted): • Task accuracy: per-domain next-token accuracy on DlD_l (MQAR) and DrD_r (arithmetic). • Separation Degree sepD_sep: normalized bank-dominance, range [−1,+1][-1,+1]; +1+1 means left bank fully dominates on DlD_l, −1-1 on DrD_r. • Cross-Talk Penalty ctP_ct: fraction of attention mass misrouted to the wrong bank, range [0,0.5][0,0.5]; lower is better. • Mixed-dataset loss: total cross-entropy on DlrD_lr. 11.3 Main Results and Full Ablation Table 4 presents the complete seven-variant additive ablation. Variants 1–3 isolate the effect of callosal cross-talk mode; variants 4–7 add brain modules incrementally to the inhibitory lateral base. The headline finding is stark: none of the first five variants (ct≈0.25P_ct≈ 0.25, sep≈0.25D_sep≈ 0.25) ever achieve lateralization, whereas variants 6 and 7 converge to ct≈0.002P_ct≈ 0.002 and sep≈0.501D_sep≈ 0.501 via a sharp phase transition. Table 4: Seven-variant additive ablation on DlrD_lr (mixed dataset), 30 epochs. Rows 1–3 test callosal cross-talk mode; rows 4–7 add brain modules to the inhibitory-lateral base. sep(L)D_sep(L) and ctP_ct at epoch 30. Bold rows trigger the lateralization phase transition. Arithmetic accuracy is 1.000 for all variants (omitted from MQAR column for clarity). Variant Accuracy Loss DlrD_lr sep(L)D_sep(L) ctP_ct Trans. Ep. MQAR (DlD_l) Arith (DrD_r) 1. Baseline (Transformer) 0.051 1.000 0.135 0.250 0.253 — 2. Lateral, inhibitory 0.052 1.000 0.134 0.250 0.253 — 3. Lateral, excitatory 0.048 1.000 0.137 0.250 0.253 — 4. +Thalamus 0.053 1.000 0.135 0.252 0.251 — 5. +Amygdala 0.050 1.000 0.135 0.251 0.251 — 6. +PFC 0.049 1.000 0.136 0.501 0.002 11 7. Full (Miniature Brain) 0.047 1.000 0.136 0.501 0.002 10 11.4 Observed Results: Module-by-Module Analysis Cross-talk mode (rows 1–3): no lateralization under any callosal regime. All three cross-talk conditions—no memory (baseline), inhibitory callosal, and excitatory callosal—converge to an identical unlateralized equilibrium: sep≈0.250D_sep≈ 0.250 and ct≈0.253P_ct≈ 0.253 for all 30 epochs. This refutes the hypothesis that inhibitory callosal coupling alone is sufficient for functional lateralization. Excitatory coupling does not worsen specialization beyond baseline; it is simply ignored when no asymmetric context signal is present. Thalamus (row 4) and amygdala (row 5): marginal improvements only. Adding the thalamic gate or amygdaloid salience module produces a small but consistent reduction in ctP_ct from 0.2530.253 to 0.2510.251, with no change in sepD_sep. These modules tighten the proposal-state geometry without providing enough asymmetric force to destabilize the symmetric equilibrium. PFC buffer (row 6): the symmetry-breaker. The PFC buffer is the critical module. Without it, 30 epochs of inhibitory callosal coupling produces no lateralization. With it, a sharp phase transition fires at epoch 11: sepD_sep jumps from 0.2530.253 to 0.4740.474 and ctP_ct collapses from 0.2500.250 to 0.0290.029 in a single update step, then converges to sep=0.501D_sep=0.501 and ct=0.002P_ct=0.002 by epoch 12. The mechanism: the PFC EMA accumulates a slow, domain-biased top-down context over epochs 1–10; when this context diverges sufficiently between MQAR and arithmetic tokens, it skews the joint softmax AlrA_lr asymmetrically, providing the non-zero inhibitory input needed to cross the bifurcation threshold. Cerebellar fast-path (row 7): accelerates by one epoch. Adding cerebellar momentum advances the transition from epoch 11 (+pfc) to epoch 10 (full), with no change in asymptotic sepD_sep or ctP_ct. The momentum accumulator aligns consecutive routing-gradient steps, pushing the PFC context toward the critical threshold one epoch faster. This confirms the cerebellum’s purely convergence-accelerating role, exactly as biological cerebellar forward-model theory predicts. 11.5 Brain Module Contributions: Observed Table 5 distills the module-level incremental gains. Thalamic gating (row 4) reduces ctP_ct marginally (0.253→0.2510.253→ 0.251) through tighter proposal-state geometry at domain boundaries. It does not cross the lateralization threshold. Amygdaloid salience (row 5) produces the same marginal ctP_ct tightening as the thalamic gate, consistent with its role as a write-magnitude modulator rather than a routing stabilizer. PFC buffer (row 6) is the decisive module: it alone triggers full lateralization. The asymmetry accumulated in the PFC EMA over 10 epochs provides the initial condition that makes the inhibitory feedback loop productive. Without this domain-context bias, the two banks receive identical callosal signals and remain symmetric. Cerebellar fast-path (row 7) accelerates the PFC-induced transition by one epoch (epoch 10 vs. epoch 11). Asymptotic sepD_sep and ctP_ct are unchanged. Table 5: Incremental module contributions on DlrD_lr, 30 epochs. Rows 4–7 show the effect of adding each module to the inhibitory-lateral base. The † marks the only module that triggers lateralization. Variant Loss DlrD_lr ctP_ct sep(L)D_sep(L) Trans. Ep. 2. Lateral, inhibitory 0.134 0.253 0.250 — 4. +Thalamus 0.135 0.251 0.252 — 5. +Amygdala 0.135 0.251 0.251 — 6. +PFC† 0.136 0.002 0.501 ep. 11 7. Full (Miniature Brain) 0.136 0.002 0.501 ep. 10 11.6 Training Dynamics: The PFC-Inhibition Phase Transition Figure 1 and the ablation data together reveal the following training dynamics (summarized in Figure 3). Symmetric phase (epochs 1–9, all variants). All seven variants share an identical symmetric trajectory during early training: sep≈0.25D_sep≈ 0.25 and ct≈0.25P_ct≈ 0.25, with both metrics declining imperceptibly (at most 0.002 from epoch 1 to 9). The thalamus and amygdala modules produce a tiny additional tightening (ct→0.251P_ct→ 0.251), but remain far from the lateralization threshold. Crucially, the inhibitory lateral model (variant 2) is indistinguishable from the baseline transformer and the excitatory variant throughout the symmetric phase. Bifurcation (epoch 10 for Full; epoch 11 for +PFC). Table 6 shows the transition window. In a single gradient update the system leaves the symmetric fixed point: sepD_sep nearly doubles and ctP_ct falls by >96%>\!96\!\%. By the following epoch (10+110+1) ct<0.005P_ct<0.005; thereafter both metrics are stationary. Table 6: Lateralization metrics surrounding the phase transition. Left: full model (transition at epoch 10). Right: +PFC only (transition at epoch 11). Full model +PFC only Epoch sepD_sep ctP_ct Epoch sepD_sep ctP_ct 9 (pre) 0.251 0.252 10 (pre) 0.253 0.250 10 (trans.) 0.493 0.010 11 (trans.) 0.474 0.029 11 (post) 0.499 0.005 12 (post) 0.498 0.005 30 (stable) 0.501 0.002 30 (stable) 0.501 0.002 Stable lateralized phase (epochs 11–30). Post-transition, sep≈0.501D_sep≈ 0.501 and ct≈0.002P_ct≈ 0.002 for both variants 6 and 7. The training (routing) loss simultaneously jumps from ≈−0.87≈-0.87 (symmetric phase) to ≈−1.59≈-1.59 at the transition epoch, then continues drifting to −1.89-1.89 by epoch 30—a 118% increase in routing reward driven entirely by correct-bank assignment, not task loss. Mechanism: PFC as symmetry-breaker, inhibition as amplifier. The inhibitory callosal weight matrices WLR,WRLW_LR,W_RL produce a zero-net perturbation when both banks hold symmetric states. Something must break this symmetry before the inhibitory feedback loop can produce sustained lateralization. The PFC EMA provides this: over epochs 1–9 it accumulates a context vector that slowly diverges between MQAR and arithmetic tokens (captured by the top-down query modulation σ(αF)FaggWF→Qσ( _F)F_aggW_F\!→\!Q). At epoch 10, this PFC-induced query asymmetry crosses the critical threshold at which the net callosal inhibitory signal is non-zero in expectation conditional on domain: the inhibitory loop now amplifies the asymmetry rather than cancelling it, and the pitchfork bifurcation fires. The cerebellar momentum accumulates routing gradients in the same direction across consecutive steps, effectively raising the PFC context drift rate and advancing the threshold crossing by one epoch. This is precisely the symmetry-breaking ++ amplification architecture described by the saddle-node and pitchfork bifurcation theory of coupled feedback oscillators [26]. Task accuracy. Arithmetic accuracy is 1.0001.000 from epoch 1 across all variants (the +1mod10+1\! \!10 rule is trivially extracted). MQAR accuracy plateaus at ≈5%≈ 5\% across all variants, indicating that the 32-pair recall task exceeds the encoding capacity of the current p=32p=32 proposal slots at this vocabulary scale; this is a capacity limitation, not a routing failure. Figure 3: Lateralization dynamics across all seven ablation variants. (a) Lateralization map: colour encodes sepD_sep as a function of training epoch (x-axis) and variant (y-axis). All five variants without a PFC buffer (top rows) remain at the uniform equilibrium (blue) throughout training. Only +PFC (variant 6) and Full (variant 7) break symmetry at epochs 11 and 10, respectively (dashed white lines). (b) Bifurcation curves: sepD_sep over time. Dotted vertical lines mark the transition epoch; the grey band shows the unlateralized plateau (sep≈0.25D_sep≈ 0.25). (c) Cross-talk penalty ctP_ct (inverted; lower is better); mirror of panel (b). (d) MQAR recall accuracy, confirming that task performance does not diverge across variants despite the routing bifurcation. 11.7 Individual Module Necessity: Observed The additive ablation (Table 4) directly reveals module necessity. Thalamic gate contributes a marginal ctP_ct reduction (0.253 to 0.251) with no effect on sepD_sep; its primary role is proposal-state noise suppression, not routing. Amygdaloid salience has the same signature: minor ctP_ct tightening, no lateralization. PFC buffer is strictly necessary: removing it from the full model (i.e., variant 5) collapses lateralization entirely. Cerebellar fast-path is not necessary for lateralization but is necessary for the single-epoch acceleration of the transition; removing it delays the transition from epoch 10 to epoch 11 with identical asymptotic behavior. 11.8 Scaling and Natural Language The synthetic benchmark is deliberately small-scale to enable controlled ablation. Extending to natural language raises two challenges: (1) domain boundaries are implicit, requiring unsupervised routing supervision; and (2) the PFC buffer may need multi-head attention over an explicit episodic transcript rather than a simple EMA to capture rich task-context structure. An extension along the lines of Park et al. [22], where a retrieval LLM (hippocampal analogue) and a reasoning LLM (PFC analogue) operate over natural-language session transcripts, would bridge the current architecture with language-grounded memory systems. A natural evaluation target is LoCoMo [17], which requires long-term associative recall across multi-session dialogues—the natural-language counterpart of the MQAR task evaluated here. 12 Discussion Is the brain analogy more than metaphor? The five modules introduced here are not merely metaphorical: each has a precise mathematical role that derives directly from the functional properties of its biological counterpart. The thalamic gate implements a quantitative analog of burst/tonic firing modalities via attention entropy. The amygdaloid salience gate implements importance-weighted synaptic potentiation via context-norm gating. The PFC buffer implements persistent activity via an exponential moving average with learnable decay. The cerebellar fast-path implements error-corrective forward-model adaptation via momentum accumulation in bank-state space. The callosal inhibitory matrix implements hemisphere-level suppression via signed cross-bank subtraction. In each case, the biological function maps onto a specific linear-algebraic operation within the A⊤AVWA \!AVW framework. Complementarity and a revised module hierarchy. The ablation overturns a simple additive view of module contributions. Variants 1–5 form an equivalence class under lateralization: despite having different callosal modes and different noise/salience modules, they all remain permanently symmetric. The phase transition requires two qualitatively different components to be present simultaneously: (i) a slow top-down context accumulator (PFC) that generates a persistent, domain-specific bias in query space; and (i) a signed cross-bank coupling (inhibitory callosal) that can amplify any asymmetry into a sustained attractor. Neither is sufficient alone: inhibitory coupling with symmetric context produces symmetric dynamics; a PFC context without inhibition produces a slowly biased but never-committing routing signal (the joint softmax would stay near 0.5 rather than saturating). This is directly analogous to the neuroscientific picture of interhemispheric lateralization, where persistent prefrontal task representations are held to prime the callosal inhibitory signal before the circuit commits to a hemispheric assignment [6]. What is still missing. The miniature brain remains incomplete. Several important regions are not yet represented: (i) the basal ganglia, which implement action/routing selection through dopaminergic gating—the routing auxiliary loss ℒrouteL_route is an implicit analogue, but an explicit, differentiable action-selection circuit would more faithfully model go/no-go gating; (i) the anterior cingulate cortex (ACC), which monitors conflict and signals when the current routing is unreliable—an entropy-based conflict monitor would complement the thalamic gain by triggering explicit re-routing when ctP_ct exceeds a threshold; (i) neuromodulatory systems (dopamine, norepinephrine, acetylcholine), which globally modulate learning rates based on reward, surprise, and attention state—connecting task utility signals to per-module learning rates (γ, ρ, ηcb _cb, βF _F) would provide a biologically grounded adaptive hyperparameter schedule. 13 Limitations and Future Work Scale. All experiments are conducted on a synthetic symbolic benchmark at small scale. Validation on natural language tasks (LoCoMo [17], StoryCloze, multi-session dialogue) is necessary to confirm that the modular gains persist at natural-language input complexity and longer sequence lengths. Routing supervision. The routing auxiliary loss requires domain labels. Unsupervised domain discovery (e.g. via mutual information between token embeddings and bank attention argmaxes) is required for deployment where domain boundaries are implicit. Incomplete brain. The basal ganglia, anterior cingulate cortex, and neuromodulatory systems are absent. Adding them would complete the major memory-relevant circuits. In particular, a differentiable basal ganglia module implementing go/no-go gating over the joint softmax AlrA_lr would replace the hand-designed routing loss with an end-to-end learned routing circuit. Interpretability. While each module has a mechanistic interpretation, empirically verifying that the modules behave as intended (e.g., that sts_t is highest for genuinely novel inputs) requires additional diagnostic probing experiments. 14 Conclusion We have extended the attention-coupled latent memory architecture of Jeong [12] into a miniature brain comprising five neuroscientifically motivated modules: a thalamic relay, lateralized hippocampal banks with callosal inhibition, an amygdaloid salience gate, a prefrontal working memory buffer, and a cerebellar fast-path momentum accumulator. A seven-variant additive ablation across 30 training epochs yields a clear and surprising finding: inhibitory callosal coupling alone does not lateralize the memory banks. Variants 1–5, including the inhibitory lateral, excitatory lateral and the baseline transformer, all converge to the same unlateralized equilibrium (sep≈0.250D_sep≈ 0.250, ct≈0.253P_ct≈ 0.253). Lateralization requires the synergy of the PFC buffer and callosal inhibition: the PFC EMA accumulates a slowly drifting domain-context bias that breaks the symmetric equilibrium after ∼ 10 epochs, at which point the inhibitory feedback loop amplifies the asymmetry in a single gradient step—a pitchfork bifurcation—collapsing ctP_ct from 0.250.25 to 0.0020.002 and raising sepD_sep from 0.250.25 to 0.5010.501. The cerebellar fast-path advances the bifurcation by one epoch (epoch 10 vs. epoch 11) through routing-gradient alignment, confirming its pure convergence-acceleration role. The result constitutes a novel, falsifiable prediction: in any dual-bank memory system with signed inter-bank coupling, functional lateralization requires a persistent top-down context signal to break the symmetric equilibrium. This maps directly onto the neuroscientific account of hemispheric lateralization, where sustained prefrontal representations are hypothesized to prime callosal inhibitory circuits before a hemispheric assignment is committed [6]. The central methodological lesson is that the A⊤AVWA \!AVW write-back operator serves as both a unifying substrate and a diagnostic tool: the ablation identifies not just which modules contribute but which interactions are functionally necessary, providing a template for principled architecture search guided by neuroscientific circuit theory. Code and Data Availability All experiment code, model weights, and result logs are released under the MIT license at https://github.com/gnoejh/paper_2026_0307_memory_transformer_arxiv.git. The repository contains the complete seven-variant ablation pipeline (experiment/main.py --run ablation), dataset generators (MQAR and modular arithmetic), metric implementations (sepD_sep, ctP_ct), figure-reproduction scripts, and saved checkpoints for all reported variants. No external datasets are used; all synthetic data are generated on the fly by experiment/dataset.py. Acknowledgements The author thanks Inha University in Tashkent for research support. This work reflects the author’s ongoing inquiry into nature and human cognition. References Arora et al. [2024] Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models. In ICLR, 2024. Botvinick et al. [2020] Matthew Botvinick, Jane Wang, Will Dabney, Kevin J. Miller, and Zeb Kurth-Nelson. Deep reinforcement learning and its neuroscientific implications. Neuron, 107(4):603–616, 2020. Cho et al. [2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of EMNLP, 2014. Damasio [1998] Antonio R. Damasio. The Feeling of What Happens: Body and Emotion in the Making of Consciousness. Harcourt Brace, 1998. Dao and Gu [2023] Tri Dao and Albert Gu. Mamba: linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. Fuster [2008] Joaquin M. Fuster. The Prefrontal Cortex, 4th edition. Academic Press, London, 2008. Graves et al. [2016] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538:471–476, 2016. Hollenstein et al. [2019] Nora Hollenstein, Antonio de la Torre, Nicolas Langer, Ce Zhang, and Barbara Plank. CogniVal: a framework for cognitive NLP evaluation. In CoNLL, 2019. Gu et al. [2022] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. In ICLR, 2022. Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. Ito [2008] Masao Ito. Control of mental activities by internal models in the cerebellum. Nature Reviews Neuroscience, 9(4):304–313, 2008. Jeong [2026] Hong Jeong. Inhibitory cross-talk enables functional lateralization in attention-coupled latent memory. arXiv:2603.03355 [q-bio.NC], 2026. Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In ICLR, 2015. Kumaran et al. [2016] Dharshan Kumaran, Demis Hassabis, and James L. McClelland. What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends in Cognitive Sciences, 20(7):512–534, 2016. LeCun et al. [1989] Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989. Lieber et al. [2024] Opher Lieber et al. Jamba: hybrid transformer–state space models for efficient long-context modeling. arXiv preprint, 2024. Maharana et al. [2024] Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Proceedings of ACL, 2024. McGaugh [2004] James L. McGaugh. The amygdala modulates the consolidation of memories of emotionally arousing experiences. Annual Review of Neuroscience, 27:1–28, 2004. Miconi et al. [2018] Thomas Miconi, Jeff Clune, and Kenneth O. Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. In ICML, 2018. O’Reilly and McClelland [1994] Randall C. O’Reilly and James L. McClelland. Hippocampal conjunctive encoding, storage while retrieving: avoiding a trade-off. Hippocampus, 4(6):661–682, 1994. Packer et al. [2024] Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: towards LLMs as operating systems. arXiv:2310.08560, 2024. Park et al. [2023] Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: interactive simulacra of human behavior. In UIST, 2023. Rao and Ballard [1999] Rajesh P.N. Rao and Dana H. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87, 1999. Ritter et al. [2018] Samuel Ritter, Jane X. Wang, Zeb Kurth-Nelson, Siddhant M. Jayakumar, Charles Blundell, Razvan Pascanu, and Matthew Botvinick. Been there, done that: meta-learning with episodic recall. In ICML, 2018. Sherman and Guillery [2001] S. Murray Sherman and R. W. Guillery. Exploring the Thalamus. Academic Press, San Diego, 2001. Strogatz [2018] Steven H. Strogatz. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. CRC Press, 2nd edition, 2018. Squire and Dede [2015] Larry R. Squire and Adam J. O. Dede. Conscious and unconscious memory systems. Cold Spring Harbor Perspectives in Biology, 7(3):a021667, 2015. Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. Weston et al. [2014] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv:1410.3916, 2014. Wu et al. [2022] Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. arXiv:2203.08913, 2022. Innocenti [1986] Giorgio M. Innocenti. General organization of callosal connections in the cerebral cortex. In A. Peters and E.G. Jones (Eds.), Cerebral Cortex, Vol. 5. Plenum Press, 1986. Bloom and Hynd [2005] John S. Bloom and George W. Hynd. The role of the corpus callosum in interhemispheric transfer of information: excitation or inhibition? Neuropsychology Review, 15(2):59–71, 2005.