Paper deep dive
PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection
Hyoseok Park, Yeonsang Park
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/26/2026, 2:30:37 AM
Summary
PRISM is a photonic similarity engine designed to overcome the O(n) memory bandwidth bottleneck in long-context LLM inference. By utilizing a thin-film lithium niobate (TFLN) broadcast-and-weight architecture, PRISM performs O(1) similarity searches for KV cache block selection, significantly reducing memory traffic and energy consumption compared to GPU-based full-scan methods.
Entities (5)
Relation Signals (3)
PRISM → utilizes → TFLN
confidence 100% · PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine.
PRISM → accelerates → KV Cache
confidence 95% · PRISM is a photonic similarity engine that sits between the KV cache storage and the GPU’s attention compute units.
KV Cache → bottlenecks → LLM inference
confidence 95% · Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache
Cypher Suggestions (2)
Find all systems that accelerate KV cache operations · confidence 90% · unvalidated
MATCH (s:System)-[:ACCELERATES]->(m:MemoryComponent {name: 'KV Cache'}) RETURN s.nameMap the relationship between architectures and their underlying materials · confidence 85% · unvalidated
MATCH (a:Architecture)-[:UTILIZES]->(m:Material) RETURN a.name, m.name
Abstract
Abstract:Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).
Tags
Links
- Source: https://arxiv.org/abs/2603.21576v1
- Canonical: https://arxiv.org/abs/2603.21576v1
Full Text
124,529 characters extracted from source content.
Expand or collapse full text
†thanks: Corresponding author: yeonsang.park@cnu.ac.kr PRISM: Breaking the O(n)O(n) Memory Wall in Long-Context LLM Inference via O(1)O(1) Photonic Block Selection Hyoseok Park Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea Yeonsang Park Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea Abstract Long-context LLM inference is bottlenecked not by compute but by the O(n)O(n) memory bandwidth cost of scanning the KV cache at every decode step—a wall that no amount of arithmetic scaling can break. The semiconductor industry increasingly acknowledges this shift: NVIDIA’s Vera Rubin architecture dedicates an entire DPU (ICMS) to KV cache management with flash-backed storage and hardware-assisted prefetch—an architectural bet confirming that memory, not arithmetic, is the first-class constraint. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation. However, these approaches inherit the same O(n)O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm—the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4–6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1)O(1). We instantiate this insight in Prism (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32k=32, with 16×16× traffic reduction at 64K context. Prism achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n≥4n≥ 4K). I Introduction The dominant cost of large language model (LLM) inference is no longer floating-point arithmetic. As autoregressive decoding generates one token at a time, each step requires reading the full key–value (KV) cache accumulated over all previous tokens, computing attention scores, and writing the result back. For a model with L layers and H attention heads, each storing key and value vectors of dimension dhd_h, the KV cache occupies 2LHdh2LHd_h bytes per token (at half precision), growing linearly with context length n. At n=128 000n=$128\,000$ tokens, a 70-billion-parameter model’s KV cache can exceed 40 GB40\,GB—comparable to the entire model weight footprint—and the memory bandwidth required to stream this cache at every decode step far exceeds the compute throughput of modern GPUs [10]. This memory wall is intensifying [17]. Context windows are expanding at an unprecedented pace: GPT-5.4 [38] and Claude Opus 4.6 [3] now support one million tokens, Gemini 2.5 Pro extends to one million with native multimodal processing [16], Meta’s Llama 4 Scout pushes the frontier to 10 million tokens with a mixture-of-experts architecture [32], and Qwen2.5 similarly extends to one million [44]. Multi-agent and retrieval-augmented generation (RAG) workloads routinely concatenate documents into contexts of hundreds of thousands to millions of tokens. NVIDIA’s response in its Vera Rubin architecture is telling: the Intelligent Connectivity and Memory Switch (ICMS), built on the BlueField-4 data processing unit (DPU), adds a flash-based KV cache tier that can hold terabytes of context, together with hardware-assisted eviction and prefetch logic [36, 35]. This architectural bet confirms that KV cache management is now a first-class system design problem. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation [20, 73, 59, 15]. In particular, Tian et al. demonstrated a photonic transformer chip (PTC) achieving over 200 POPS for full attention via coherent Mach–Zehnder meshes [59]. However, these approaches target dense matrix–vector multiplication and inherit the same O(n)O(n) memory scaling as electronic attention when applied to long contexts—every KV entry must still be transferred to the photonic chip. Electronic block-selection methods (Quest [58], RocketKV [6]) reduce the number of blocks fetched, but the selection scan itself—computing similarity scores across all N candidates—still traverses the memory hierarchy at O(N)O(N) cost. Neither photonic dense computation nor electronic sparse selection breaks the linear scaling. The fundamental bottleneck is therefore not attention computation, but the selection scan itself: the O(N)O(N) memory read required to evaluate all candidate blocks. Eliminating this scan—rather than merely reducing the number of blocks fetched—changes the scaling from O(N)O(N) to O(k)O(k), independent of the total number of cached blocks. We observe that the coarse block-selection step is a memory-bound similarity search that determines which of N∼103N 10^3–10410^4 KV blocks to fetch. This search has three properties that are structurally matched to the photonic broadcast-and-weight (B&W) paradigm [56, 54]: (i) the query fans out identically to all candidates, mapping directly to passive optical splitting; (i) stored block signatures are quasi-static (updated every 64–512 tokens), matching fast MRR electro-optic programming (Pockels effect); and (i) only rank order matters, relaxing precision to 4–6 bits. Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation latency remains O(1)O(1) (though the hardware footprint scales as O(d×N)O(d× N) MRRs). In the B&W architecture, a WDM laser comb encodes a d-dimensional query onto d co-propagating wavelengths; a 1×N1× N passive splitter broadcasts identical copies to N channels; a bank of microring resonators (MRRs) on thin-film lithium niobate (TFLN) applies programmable weights at each channel; and a broadband photodetector integrates over all wavelengths, yielding the analog inner product in O(1)O(1) optical latency. We therefore propose scan-free block selection—eliminating the O(N)O(N) memory read entirely—and instantiate this principle using a photonic broadcast-and-weight engine, where the natural broadcast property of optics makes scan-free selection physically native to the hardware. The resulting system is not a general neural-network accelerator, but a specialized similarity engine for the memory-bound block-selection task. A crucial observation simplifies the problem. Not all attention heads actually need the full cache. Recent work on retrieval heads [64, 66, 57] has shown that attention heads split into two categories: retrieval heads that attend to tokens far from the current position, and streaming heads that attend primarily to nearby tokens and “attention sinks.” The fraction classified as retrieval heads is threshold-dependent: DuoAttention identifies approximately 25% of heads as retrieval heads in MHA models and approximately 50% in GQA models via learned gating optimization [66], while our profiling on Qwen2.5-7B finds over 90% at a relaxed threshold (τ=0.3τ=0.3; Sec. V). This discrepancy reflects differing identification criteria rather than a contradiction—the key insight is that only the retrieval subset requires distant block fetches. This asymmetry has motivated a family of block-level selection methods that implement a coarse candidate selection step followed by fine attention over only the selected blocks on the GPU [58, 6, 65, 30, 26]. Complementary strategies include token-level eviction (H2O [71], StreamingLLM [67, 18]), two-stage coarse–fine retrieval (RocketKV [6]), and hardware-assisted caching (NVIDIA ICMS [35]). All electronic approaches share a common limitation: the coarse selection step itself consumes memory bandwidth proportional to the number of stored blocks. Recent analysis confirms that this block selection phase can consume the majority of total KV retrieval latency [29]. A photonic inner-product engine can break this scaling by performing all N similarity evaluations in parallel, using wavelength multiplexing to avoid the sequential memory access pattern entirely. We propose Prism (Photonic Ranking via Inner-product Similarity with Microring weights), a TFLN photonic similarity engine that realizes the photonic broadcast search concept for KV cache block selection. Prism encodes the query sketch onto d WDM wavelength channels, broadcasts it to N parallel MRR weight-bank channels via a 1×N1× N optical splitter, and computes all N similarity scores—each as an analog optical dot product In∝∑jwn,jsjI_n _jw_n,j\,s_j—in O(1)O(1) optical latency. A compact electronic top-k comparator selects the highest-scoring block indices, and only the corresponding KV blocks are fetched from memory. Figure 1 contrasts the conventional electronic full-scan approach with the Prism photonic block-selection pipeline. Figure 1: Conceptual comparison of KV cache access strategies. Left: GPU–HBM full scan—the processor sequentially reads all N KV blocks from HBM to compute attention, bottlenecked by memory bandwidth. Right: Prism selective fetch—the query is broadcast optically to all N signature channels in parallel; only the top-k highest-scoring blocks are fetched from memory, reducing traffic by N/kN/k times. Our contributions are as follows: 1. Structural match and photonic broadcast search architecture. We identify a structural match between the KV cache block-selection task and the photonic broadcast-and-weight paradigm: the query fans out identically to all candidates (passive optical splitting), stored signatures are quasi-static (fast MRR electro-optic programming), and only rank order matters (relaxed precision). We propose and analyze a photonic similarity engine that exploits this match, present a complete optical power budget analysis covering splitting loss, MRR insertion loss, and photodetector noise floors, and derive the signal-to-noise ratio (SNR) requirements for reliable top-k ranking (Sec. I and Sec. IV). 2. Hardware-aware impairment modeling and NIAH validation. We build a device-level impairment model incorporating weight quantization (4–8 bits), residual thermal drift, insertion loss chains, photodetector noise, and MRR crosstalk, and show that recall degrades by less than 10% under realistic conditions. End-to-end needle-in-a-haystack (NIAH) evaluation with Qwen2.5-7B demonstrates that MRR-selected block-sparse attention matches full-attention accuracy at context lengths from 4K to 64K tokens (within the model’s native context window), while replacing the electronic selection with photonic O(1)O(1)-latency computation. Beyond 64K, model-intrinsic accuracy degrades independent of block selection (Sec. IV). 3. Photonic scaling analysis. We derive energy and latency models for Prism and electronic baselines (GPU full scan, GPU ANN, NVIDIA ICMS), identifying the context-length crossover point above which Prism is favorable, and analyze how the photonic architecture scales to million-token contexts (Sec. VI). 4. Double-scaling advantage and retrieval head analysis. We systematically profile retrieval-head ratios across Qwen2.5-7B and Qwen3-8B, confirming that over 90% of KV heads are retrieval heads (at threshold τ=0.3τ=0.3), and reveal a double-scaling advantage: as context grows, both the per-head benefit of O(1)O(1) photonic computation and the fraction of heads requiring acceleration increase simultaneously. We evaluate block-level signatures, demonstrating that mean-key projection achieves 77.3% recall@8 with d=32d=32 (Sec. V). I Background I.1 KV Cache in Transformer Inference The core of modern LLMs is the multi-head self-attention mechanism [62]. Given an input sequence of n tokens embedded as ∈ℝn×dmodelX ^n× d_model, each attention head h in layer ℓ projects the input into queries, keys, and values: (ℓ,h) ^( ,h) =Q(ℓ,h), =X\,W_Q^( ,h), (ℓ,h) ^( ,h) =K(ℓ,h), =X\,W_K^( ,h), (ℓ,h) ^( ,h) =V(ℓ,h), =X\,W_V^( ,h), (1) where Q,K,V∈ℝdmodel×dhW_Q,W_K,W_V ^d_model× d_h and dh=dmodel/Hd_h=d_model/H is the per-head dimension. The attention output is computed as Attn(,,)=softmax(Tdh).Attn(Q,K,V)=softmax\! ( Q\,K^T d_h )V. (2) During the autoregressive decode phase, the model generates one token at a time. At step t, only the new query vector t∈ℝdhq_t ^d_h is computed, but the attention score requires the inner product of tq_t with all t previously cached key vectors: αt,i=t⋅idh,i=1,…,t. _t,i= q_t·k_i d_h, i=1,…,t. (3) The KV cache stores (ℓ,h)K^( ,h) and (ℓ,h)V^( ,h) for all layers and heads, consuming memory MKV=2LHKVdhnbprec,M_KV=2\,L\,H_KV\,d_h\,n\,b_prec, (4) where HKVH_KV is the number of KV heads (which equals H for multi-head attention but is reduced under grouped-query attention, GQA [2, 49]) and bprecb_prec is the byte width per element (2 for BF16). For Llama-3.1-8B (L=32L=32, HKV=8H_KV=8 with 4-group GQA, dh=128d_h=128) at n=128 000n=$128\,000$, eq. 4 gives MKV≈16 GBM_KV≈$16\,GB$, which already consumes a substantial fraction of GPU HBM and grows linearly with n. Crucially, the decode phase is memory-bandwidth-bound: each generated token requires reading the entire KV cache but performs only O(n⋅dh)O(n· d_h) multiply-accumulate operations per head. The arithmetic intensity (FLOPs per byte) is 1/(2dh)≪11/(2d_h) 1, far below the compute-to-bandwidth ratio of modern GPUs (50 FLOP/B to 200 FLOP/B50\,FLOP/B200\,FLOP/B), leaving the compute units idle while waiting for data [10]. I.2 Retrieval Heads and Selective Attention The observation that not all attention heads require the full KV cache was formalized by DuoAttention [66] and RazorAttention [57]. These works define a retrieval ratio Rh(ℓ,h)R_h^( ,h) for each head as the fraction of attention mass that falls outside a local window of size w: Rh(ℓ,h)=1−1T∑t=1T∑i=max(1,t−w)tαt,i(ℓ,h),R_h^( ,h)=1- 1T _t=1^T _i= (1,t-w)^t _t,i^( ,h), (5) where αt,i(ℓ,h) _t,i^( ,h) is the attention weight from eq. 3 and T is the total sequence length of a calibration corpus. Heads with Rh>τR_h>τ (typically τ≈0.1τ≈ 0.1) are classified as retrieval heads; the rest are streaming heads. Empirically, DuoAttention identifies approximately 25% (MHA) to 50% (GQA) of heads as retrieval heads via learned gating optimization [66]. Streaming heads can be served with a small sliding-window cache (e.g., w=256w=256), drastically reducing their memory footprint. However, retrieval heads still require access to the full context, making their KV traffic the dominant bottleneck. I.3 Photonic Similarity Engine As noted in Sec. I, the coarse block-selection step is a similarity search whose properties—identical query fan-out, quasi-static weights, and rank-order-only output—make it a natural fit for photonic broadcast-and-weight hardware. We now review the key photonic concepts underlying this match. Broadcast-and-weight architecture. Tait et al. [56, 54] introduced the broadcast-and-weight (B&W) paradigm for neuromorphic photonic networks. In this architecture, d input signals are encoded on distinct wavelengths λ1,…,λd _1,…, _d and broadcast via a 1×N1× N optical splitter to N output channels. Each output channel contains d microring resonators (MRRs), each tuned to one wavelength, whose transmission coefficients serve as programmable weights wn,jw_n,j for channel n and wavelength j. A wavelength-insensitive photodetector at each output integrates over all wavelengths, yielding the photocurrent: In=ℛP0∑j=1dwn,jsj,I_n=R\,P_0 _j=1^dw_n,j\,s_j, (6) where ℛR is the detector responsivity, P0P_0 the per-channel optical power after splitting, and sjs_j the query signal on wavelength λj _j. The photocurrent InI_n is thus proportional to the inner product n⋅w_n·s—precisely the similarity score between stored signature n and the broadcast query. This operation completes in a single optical transit time (∼ 10 ps10\,ps per mm), independent of d and N (up to splitting loss limits). WDM spectral encoding. The query vector is encoded in the spectral domain: each component sjs_j modulates the optical power on a dedicated wavelength channel λj _j, so the full d-dimensional vector propagates as a single multi-wavelength beam in one waveguide. This spectral encoding is distinct from spatial encoding, where each component occupies a separate waveguide, because it enables the key broadcast step—splitting one waveguide into N copies—with no additional multiplexing hardware. Channel spacings of 0.8 nm to 1.6 nm0.8\,nm1.6\,nm within the C-band support d=32d=32–128128 channels using standard dense WDM (DWDM) laser combs and MRR filter banks. Comparison with other photonic paradigms. Alternative photonic architectures—MZI meshes [46, 9, 51, 15] and coherent processors—require O(d2)O(d^2) elements or global phase stability, and do not naturally support the one-to-many fan-out needed for similarity search. The broadcast-and-weight paradigm uses incoherent intensity-domain processing, where each MRR operates independently and the photodetector sums power rather than field amplitude, eliminating the need for global phase coherence and making it uniquely suited to the block-selection task. MRR weight banks. Each output channel employs d microring resonators whose electro-optically tunable transmission implements programmable weights w∈[0,1]w∈[0,1]. The MRR physics and TFLN-specific device parameters are detailed in Secs. I.4 and I.6. WDM-based matrix–vector multiplication. Scalable MRR weight banks with up to 16 wavelength channels and ∼ 7-bit precision have been demonstrated [72, 21], building on early silicon MRR electro-optic modulators [68], and recent large-scale photonic accelerators validate integration beyond 16 00016\,000 components [20, 73, 15, 69]. The key advantage of this photonic approach for the KV cache selection problem is that the “weight matrix”—the collection of block signatures—is quasi-static and can be programmed into MRR resonances via electro-optic tuning, while the “input vector”—the query sketch—changes at every decode step but is broadcast optically to all N channels simultaneously. This decoupling of weight programming rate from inference rate is what enables the O(1)O(1) optical latency scaling that electronic approaches cannot match. While Lightening-Transformer [73] targets full attention computation, Prism takes a complementary approach: accelerating only the lightweight block-selection ranking task, which requires lower precision and fewer channels, making the photonic implementation more practical. I Photonic Retrieval Architecture I.1 System Overview Prism is a photonic similarity engine that sits between the KV cache storage (HBM or flash-backed ICMS) and the GPU’s attention compute units. It does not replace any part of the GPU pipeline; rather, it acts as a photonic broadcast search module that determines which KV cache blocks should be fetched for each retrieval head at each decode step. Figure 2: Prism system architecture (five-stage pipeline). Stage 1 (Query Encoding): The GPU/ASIC computes the query sketch =[q1,…,qd]q=[q_1,…,q_d] and encodes each component onto a WDM wavelength via DAC-driven modulators, producing a WDM query signal where P(λj)=qjP( _j)=q_j. Stage 2 (Broadcast): A 1×N1× N optical splitter distributes identical copies of the d-wavelength signal to all N signature channels (splitting loss: −10log10N-10 _10N dB). Stage 3 (Signature Weighting): Each channel passes through a row of d MRRs on the TFLN photonic chip; the transmission tij=sijt_ij=s_ij of each MRR is electro-optically programmed via DC bias electrodes to encode the block signature weight, performing wavelength-selective multiplication Pout(λj)=qj×sijP_out( _j)=q_j× s_ij. Stage 4 (Summation): Broadband photodetectors integrate all wavelengths, yielding photocurrents Ii=ℛ⋅∑j(qj⋅sij)I_i=R· _j(q_j· s_ij) that are proportional to the inner product ⋅iq·s_i. Stage 5 (Top-k Selection): ADCs digitize the N photocurrents, a digital top-k selector identifies the k highest-scoring block indices, and a memory controller fetches only those KV blocks from HBM/flash storage. The system operates as a five-stage pipeline, illustrated in fig. 2. For each retrieval head at each decode step: 1. The GPU computes the query vector tq_t and applies the signature projection to obtain a d-dimensional query sketch qs_q. 2. qs_q is converted to the optical domain and broadcast. 3. The photonic weight bank computes N inner products in parallel. 4. Photodetectors produce N analog similarity scores. 5. A digital top-k selector identifies the best blocks, and only those blocks are fetched from KV cache storage. The GPU then computes exact attention over the selected blocks plus the local sliding window, producing the final attention output. I.2 Signature Encoding The performance of Prism depends critically on the quality of the block-level signatures programmed into the MRR weight banks. Since signature encoding defines the input interface between the digital LLM pipeline and the photonic engine, we describe it first. We consider four signature construction methods. Mean key. The simplest approach averages the key vectors within each block: σn=1B∑i∈blockni(ℓ,h). _n= 1B _i \,nk_i^( ,h). (7) This preserves the original key-space geometry but requires dhd_h-dimensional signatures (e.g., dh=128d_h=128), demanding a correspondingly large number of MRRs per channel. PCA projection. Principal component analysis over the key distribution yields a projection matrix ∈ℝd×dhP ^d× d_h (d≪dhd d_h) that captures the dominant variance directions. The signature becomes σn=¯n _n=P\, k_n, reducing the MRR count per channel from dhd_h to d. Random projection. The Johnson–Lindenstrauss (JL) lemma guarantees that a random Gaussian matrix ∈ℝd×dhR ^d× d_h with d=O(ϵ−2logN)d=O(ε^-2 N) preserves pairwise distances (and hence inner-product rankings) to within a factor 1±ϵ1±ε with high probability [24]. The query sketch is computed identically: q=ts_q=R\,q_t. Random projection is attractive because it requires no training and provides worst-case guarantees. Learned projection. A trainable linear layer proj∈ℝd×dhW_proj ^d× d_h is optimized end-to-end to maximize recall@k on a calibration set. This can outperform random projections when the key distribution has exploitable structure, but requires per-model training. Balanced photodetection. The add-drop MRR configuration provides both through-port and drop-port outputs simultaneously. A balanced photodetector pair measures the differential photocurrent Ibal=Ithrough−IdropI_bal=I_through-I_drop, yielding a signed weight wn,j=Tthrough(λj)−Tdrop(λj)∈[−1,+1]w_n,j=T_through( _j)-T_drop( _j)∈[-1,\,+1]. On-resonance (minimum through-port transmission), w≈−1w≈-1; fully detuned, w≈+1w≈+1. This eliminates the need for split encoding or ReLU projection, enabling direct signed inner products with d MRRs per channel (half the count of split encoding) while preserving full sign information. I.3 WDM Query Broadcast The d-dimensional query sketch q=[s1,s2,…,sd]s_q=[s_1,s_2,…,s_d] is converted from the digital domain by d digital-to-analog converters (DACs), each driving a Mach–Zehnder modulator (MZM) [63] that impresses the value sis_i onto wavelength λi _i from a WDM laser comb source. The modulated signals are multiplexed into a single waveguide carrying d wavelength-encoded values [48, 61]. The DAC resolution requirement is modest: since the task is ranking rather than exact computation, 4–6 bits of input precision suffice (Sec. IV.1). This relaxation is critical because high-resolution, high-speed DACs are a major energy cost in photonic accelerators. At 4-bit resolution, a DAC operating at 1 GSa/s1\,GSa/s consumes approximately 0.5 mW0.5\,mW per channel. Signed query encoding. The WDM intensity encoding P(λj)=sjP( _j)=s_j is inherently non-negative. To encode signed query sketches, we apply an offset-bias transformation: s~j=sj−smin s_j=s_j-s_ , where smin=minjsjs_ = _js_j. The shifted values s~j≥0 s_j≥ 0 are directly encoded as optical powers. At each photodetector, the output becomes In=ℛ∑jwn,js~jP0=ℛ∑jwn,j(sj−smin)P0=ℛP0[∑jwn,jsj−smin∑jwn,j]I_n=R _jw_n,j s_jP_0=R _jw_n,j(s_j-s_ )P_0=RP_0 [ _jw_n,js_j-s_ _jw_n,j ]. The bias term smin∑jwn,js_ _jw_n,j is identical for all channels sharing the same weight sum and can be digitally subtracted after ADC conversion, or—if only rank ordering matters—simply ignored when ∑jwn,j _jw_n,j is approximately constant across blocks (which holds for normalized signatures). This offset-bias approach requires no additional optical hardware and preserves the O(1)O(1) broadcast property. The multiplexed d-wavelength signal is then split into N copies by a 1×N1× N optical splitter tree. Each copy carries the full query sketch, attenuated by the splitting loss: Lsplit=10log10(N)+αexcess⌈log2N⌉[dB],L_split=10 _10(N)+ _excess\, _2N [dB], (8) where αexcess≈0.2 dB _excess≈$0.2\,dB$ per stage for optimized 1×21× 2 directional couplers. For N=1024N=1024 blocks, the total splitting loss is approximately 32 dB32\,dB, requiring a laser source power of 10 dBm to 20 dBm10\,dBm20\,dBm to maintain adequate signal-to-noise ratio (SNR) at the photodetectors. To manage loss, the N channels can be organized into NbankN_bank independent banks, each serving N/NbankN/N_bank channels with a separate splitter tree. This reduces per-bank splitting loss at the cost of additional laser sources or optical amplifiers. The key point is that the broadcast is passive (no per-channel energy beyond splitting loss): the same query vector reaches all N channels simultaneously, with no per-channel memory access or data movement. I.4 MRR Weight Bank Similarity Engine Each of the N output channels contains a linear array of d MRRs, one per wavelength channel. The j-th MRR in channel n is electro-optically tuned so that its transmission at wavelength λj _j encodes the signature weight wn,jw_n,j: Pout,n(λj)=wn,j⋅Pin(λj),wn,j=Tthrough,n(λj)−Tdrop,n(λj)∈[−1,+1]. splitP_out,n( _j)&=w_n,j· P_in( _j),\\ w_n,j&=T_through,n( _j)-T_drop,n( _j)∈[-1,\,+1]. split (9) The total number of MRRs in the system is d×Nd× N. For d=64d=64 and N=1024N=1024, this yields 65 53665\,536 MRRs—a large but feasible integration scale for current photonic platforms [21]. Weight programming occurs at the block completion rate. When a new KV cache block of B tokens is completed, the corresponding column of MRR weights is updated via electro-optic (Pockels) tuning with sub-nanosecond response time. During steady-state decoding, the weight bank is static and the only dynamic signal is the broadcast query sketch. Because TFLN EO tuning is capacitive, the MRR weight bank consumes near-zero static power—only switching energy (∼5 fJ $5\,fJ$ per weight update) is required. This 5 fJ5\,fJ figure refers to the MRR electrode charging energy alone; the total switching energy including CMOS driver circuits is estimated at 50–500 fJ. Each channel terminates in a broadband (wavelength-insensitive) photodetector [28] that integrates the optical power across all d wavelengths: In=ℛ∑j=1d[Tthrough,n(λj)−Tdrop,n(λj)]sjP0=ℛ∑j=1dwn,jsjP0, splitI_n&=R _j=1^d [T_through,n( _j)-T_drop,n( _j) ]s_j\,P_0\\ &=R _j=1^dw_n,j\,s_j\,P_0, split (10) where wn,j∈[−1,+1]w_n,j∈[-1,\,+1]. This is precisely an analog optical dot product: the photocurrent In∝∑j=1dwn,jsjI_n _j=1^dw_n,j\,s_j computes the similarity score n⋅qw_n·s_q between stored block signature n and the broadcast query [42], with no explicit multiply-accumulate circuit. The physics of broadband photodetection inherently performs the summation—no electronic accumulator is needed [13]. I.5 Electronic Top-k Interface The N photocurrents are converted to digital values by N ADCs and fed to a digital top-k comparator network. The comparator identifies the k channels with the largest similarity scores and outputs their indices. For k≪Nk N, a partial-sort network suffices, with complexity O(Nlogk)O(N k) and latency of a few nanoseconds at 1 GHz1\,GHz clock. The ADC resolution can be as low as 4–6 bits, since only the rank ordering matters. I.6 Device Parameters Figure 3: Prism photonic chip layout for an 8×88× 8 configuration (d=8d=8 WDM channels, N=8N=8 signature rows). Left: the WDM query input (λ1 _1–λ8 _8) enters and is split by cascaded 1×21× 2 Y-junctions. Center: each row contains d MRRs coupled to a bus waveguide with coupling gap of ∼ 200–300 nm; EO DC bias electrodes program the MRR resonances to encode signature weights via the Pockels effect. Right: through-port and drop-port outputs route to balanced Ge-on-Si PD pairs (or optionally on-chip integrated photodetectors). Inset: waveguide cross-section showing the TFLN rib waveguide (600 nm LN on SiO2). Scale bar: 500 µm m. The layout scales to d=32d=32, N=256N=256 by increasing the splitter tree depth and the number of rows. Table 1 summarizes the assumed device parameters for the thin-film lithium niobate (TFLN) photonic platform, based on recent demonstrations of high-Q TFLN micro-ring resonators [19, 74] and MRR weight bank architectures [55, 21]. The physical chip layout for an 8×88× 8 demonstration configuration is shown in fig. 3. Table 1: Prism device parameters (TFLN platform). Parameter Value Notes Platform X-cut TFLN 600 nm LN/SiO2 Waveguide Rib, 1.4×0.6μ1.4\!×\!0.6\, 500 nm etch MRR radius 20μ20\, FSR≈\,≈\,8.3 nm QLQ_L ∼104 \!10^4‡ FDTD: 12,500 Extinction >>15 dB Add-drop Wt. precision 5 bit EO resolution Tuning EO (Pockels) 28.5 pm/V Static power ∼0 0 Capacitive EO Switch energy ∼ 5 fJ/ring Per update Tuning speed <<1 ns EO response Modulator TFLN MZM VπL∼V_πL\! \!2 V⋅·cm Photodetector Balanced PD Differential WDM ch. d=32d\!=\!32–128 1.6 nm spacing† Laser Comb source ≤ 100 mW ‡ The FDTD-simulated QL=12,500Q_L=12,500 is limited by mesh discretization, not by material or design physics. Published TFLN micro-ring resonators achieve intrinsic Qi≥106Q_i≥ 10^6–10810^8 [19, 74]; at Qi=106Q_i=10^6, the loaded QLQ_L would rise to ∼25,000 25,000, improving weight precision and extinction ratio. Higher Q strictly improves Prism performance; the Q=104Q=10^4 design point used here is therefore conservative. See Park & Park [41] for a detailed analysis of FDTD resolution limits and projected high-Q operating regimes. † d=32d=32–6464 is realistic with current C+L band technology; d=128d=128 requires C+L+S band operation and has not been experimentally demonstrated. The total MRR count (d×Nd× N; eq. 9) scales with configuration as shown in table 7. Because TFLN electro-optic tuning is capacitive, the static power consumption is near zero (Sec. VI.2). IV Photonic Hardware Analysis We now incorporate realistic photonic device impairments into the Prism simulation and quantify the optical link budget, noise performance, and energy–latency tradeoffs against electronic baselines. IV.1 Device Impairment Modeling We model six impairment sources that degrade the ideal inner-product computation of eq. 6 [7]: (i) weight quantization (4–8 bit DAC precision) [14], (i) thermal drift of MRR resonance wavelengths (σdrift=0.01 nm to 0.1 nm _drift=$0.01\,nm0.1\,nm$) [39], (i) MRR and waveguide insertion loss, (iv) photodetector shot and thermal noise (NEP ∼10 pW/Hz $10\,pW/ Hz$), (v) inter-channel MRR crosstalk (−15 dB to −30 dB-15\,dB-30\,dB isolation), and (vi) input DAC quantization noise. Table 2 summarizes the parameter ranges used in the hardware simulation. Full impairment models are provided in Supplementary Section S1. IV.2 Optical Link Budget A critical question for any photonic accelerator is whether sufficient optical signal-to-noise ratio (SNR) can be maintained across the complete optical path [51]. The complete optical link budget is provided in Supplementary table S1. For the nominal d=32d=32, N=256N=256 configuration with Plaser=20 dBmP_laser=$20\,dBm$, the per-detector received power is −15.9 dBm-15.9\,dBm (25.7 µW25.7\, W). Balanced detection link budget. The link budget in Supplementary table S1 traces the drop-port path to the target photodetector. In the balanced configuration used by Prism, each MRR channel requires two optical paths—through-port and drop-port—each terminated by a separate photodetector and TIA. The through-port path sees lower loss (no drop-port penalty), so the drop-port budget above represents the worst case. Consequently, balanced detection doubles the photodetector and TIA count to 2N2N per wavelength channel; this overhead is reflected in table 3. At PPD=−15.9 dBm≈25.7 µWP_PD=$-15.9\,dBm$≈$25.7\, W$ per detector, the resulting photocurrent is Iph=ℛ⋅PPD=1.0×25.7 µW=25.7 µAI_ph=R· P_PD=1.0×$25.7\, W$=$25.7\, A$. The electrical SNR at the detector is SNR=Iph22eIphΔf+4kBTΔf/RL+(ℛ⋅NEP)2Δf,SNR= I_ph^22eI_ph f+4k_BT f/R_L+(R·NEP)^2 f, (11) where Δf≈1 GHz f≈$1\,GHz$ (matching the query update rate) and NEP=10 pW/HzNEP=$10\,pW/ Hz$ (eq. S5). For RL=1 kΩR_L=$1\,k $ and T=300 KT=$300\,K$, we obtain SNR ≈37.2 dB≈$37.2\,dB$—well above the minimum required for reliable rank ordering [53]. (Note: RL=1 kΩR_L=$1\,k $ assumes a transimpedance amplifier (TIA) front-end rather than 50 Ω50\, termination.) For larger bank sizes (N=1024N=1024), the additional 6 dB6\,dB splitting loss reduces the per-detector power to −21.9 dBm≈6.5 µW$-21.9\,dBm$≈$6.5\, W$, yielding SNR ≈25.5 dB≈$25.5\,dB$. This remains adequate for top-k ranking, as verified by the recall analysis in Sec. IV.3. Beyond N=4096N=4096 (SNR ≈13.5 dB≈$13.5\,dB$), the link budget requires either a higher-power laser (Plaser>26 dBmP_laser>$26\,dBm$) or the banked splitter architecture described in Sec. I.3. Figure 4 illustrates the per-detector received power and SNR as a function of the bank size N, clearly showing the crossover point at which banked architectures or optical amplification become necessary. Figure 4: Optical power budget analysis. (a) Per-detector received power vs. bank size N for three laser powers. The horizontal dashed line indicates the minimum detectable power (−20 dBm-20\,dBm). (b) Electrical SNR at the photodetector vs. signature dimension d for N=256N=256 and N=1024N=1024. The shaded region marks SNR >20 dB>$20\,dB$, sufficient for reliable top-k ranking. Figure 5: MRR weight encoding principle. (a) Through-port and drop-port transmission of a single add-drop MRR (QL=10,000Q_L=10,000, ER =20=20 dB). The balanced weight w=Tthrough−Tdropw=T_through-T_drop maps from −1-1 (on-resonance) to +1+1 (fully detuned). (b) Weight-to-balanced-transmission mapping for different DAC precisions. Table 2: Device impairment parameter ranges used in hardware simulation. Impairment Parameter Range Weight quantization b (bits) 4–8 Thermal drift σdrift _drift (pmpm) 10–100 MRR insertion loss ILMRR_MRR (dBdB) 0.02–0.05c Splitter excess loss αexcess _excess (dB/stagedB/stage) 0.1–0.3 Detector NEP (pW/HzpW/ Hz) 1–20 MRR crosstalk Isolation (dBdB) −-15 to −-30 DAC resolution bDACb_DAC (bits) 4–8 c Through-port IL per non-target MRR; drop-port (target MRR) IL is ∼0.1 0.1 dB. IV.3 Recall Degradation Analysis We inject impairments into the inner-product computation and measure recall@k degradation relative to the ideal (floating-point) baseline. Individual impairment sweeps (quantization precision, thermal drift, weight fidelity, and detector noise) are presented in Supplementary Figs. S1–S4. Figure 6: Digital vs. MRR photonic inner-product comparison (d=32d=32, N=256N=256, K=8K=8). (a) Score correlation between exact (FP64) and MRR-computed similarity for three hardware configurations. Pearson correlation ρ>0.98ρ>0.98 for all configs. (b) Rank agreement for the nominal config (5-bit, 20 pm20\,pm): green triangles indicate correctly identified top-K blocks (7/8 match, Recall@8 =88=88%). (c) Normalised score error distributions; pessimistic config (4-bit, 30 pm30\,pm) shows wider tails but remains zero-centred. Combined impairments. We simulate the full impairment chain (quantization + drift + loss + noise + crosstalk) using a Monte Carlo approach with 100 trials of 500 blocks (d=32d=32). Figure 6 visualises the effect for a single trial: the MRR scores correlate strongly with the digital baseline (ρ>0.98ρ>0.98), and the top-k ranking is largely preserved. Figure 7 maps the recall degradation as a function of both weight precision and thermal drift magnitude, identifying the operating region in which Recall@8 exceeds 80%. The combined recall degradation at b=6b=6, σth=0.01 _th=0.01, and σdet=0.01 _det=0.01 is approximately 8%, yielding an effective Recall@8 of 0.916±0.0870.916± 0.087 (vs. 1.000 ideal). Each impairment source individually contributes modestly (5-bit quantization: 0.904, drift: 0.948, noise: 0.928), but their combination remains above the 90% threshold required for effective block selection. Figure 7: Combined impairment sensitivity: Recall@8 as a function of weight precision (bits) and thermal drift σ (d=32d=32, N=500N=500, 50 Monte Carlo trials per cell). Markers indicate the three operating points studied in this work: nominal (5-bit, σ=0.01σ=0.01), optimistic (6-bit, σ=0.005σ=0.005), and pessimistic (4-bit, σ=0.02σ=0.02). Recall exceeds 80% for ≥5≥5-bit precision and σ≤0.02σ≤ 0.02. The recall degradation results establish the acceptable operating region for the MRR weight bank. End-to-end NIAH validation with MRR-simulated block selection, confirming that these impairments do not degrade downstream task accuracy, is presented in Sec. V.3. IV.4 Energy Model Table 3 breaks down the energy per query evaluation for the Prism system. We define the energy metric as the total energy consumed by the photonic pipeline to evaluate all N blocks simultaneously for a single query on a single KV head (i.e., one complete selection operation). Table 3: Prism energy breakdown per query—all N=1024N=1024 blocks evaluated simultaneously (d=64d=64, k=32k=32, TFLN platform). Component Power (mWmW) Energy/query (pJpJ) Laser source 100.0 900 TEC (thermal stab.) 1000 9000† Voltage driver array 5.0 45 DACs (d channels) 32.0 288 MZM modulators 6.4 58 EO bias (static) ∼0 0 ∼0 0∗ Photodetectors (2N2N, balanced) 10.0 90 TIAs + ADCs (2N2N, balanced) 100.0 900 Top-k logic 1.0 9 Dynamic subtotal 254.4 2290 Total (incl. TEC) 1254.4 11290 ∗TFLN EO tuning is capacitive; switching energy ∼ 5 fJ/ring. †The 1 W TEC is a fixed overhead. At throughput T head-queries/s, TEC adds 1/T1/T J per head-query (e.g., T=105T=10^5: ∼10μ 10\, per head-query). A key advantage of the TFLN platform is the elimination of static MRR tuning power. TFLN electro-optic tuning via the Pockels effect is capacitive and consumes near-zero static power (see Sec. VI.2 for the quantitative SOI comparison). The only energy cost per weight update is the switching energy of ∼5 fJ $5\,fJ$ per ring, which is negligible compared to the dynamic optical and electronic components [48]. Note that while the total system power (∼1.25 W $1.25\,W$) is dominated by the TEC, this is a fixed overhead independent of query rate. At a throughput of T head-queries per second, the amortized TEC contribution is 1/T1/T joules per head-query. For T=105T=10^5 (e.g., 100 concurrent users × 1000 tokens/s), TEC adds ∼10 µJ $10\, J$ per head-query—still well below the GPU baseline. For comparison, the H100 GPU full-scan baseline reads every KV block signature once per query per head. The energy per selection is Escan=2dhNbprecEbyte=2×128×1024×2×31pJ/B≈16.3 µJ, splitE_scan&=2\,d_h\,N\,b_prec\,E_byte\\ &=2× 128× 1024× 2× 31\;pJ/B\\ &≈$16.3\, J$, split (12) where dh=128d_h=128 is the head dimension, N=n/B=1024N=n/B=1024 blocks at 128K context (B=128B=128), bprec=2b_prec=2 B (bf16), and Ebyte≈31 pJ/BE_byte≈$31\,pJ/B$ (≈3.9 pJ/bit≈$3.9\,pJ/bit$, standard HBM3 specification) [8]. Note that this baseline assumes GPU scans the full key dimension dh=128d_h=128; if the GPU instead scans compressed d=32d=32 signatures, the energy reduces to ∼4.1 µJ $4.1\, J$ (4×4× lower). Even in this fairer comparison, Prism’s ∼2290 pJ $2290\,pJ$ selection energy remains ∼1800× 1800× below the GPU scan [13]. GPU ANN (FAISS IVF-PQ) reduces the full-key scan to ∼ 5 µJ5\, J by scanning O(N)O( N) centroids. NVIDIA ICMS consumes ∼ 10 µJ10\, J, estimated by replacing BWHBMBW_HBM with the DPU’s internal LPDDR5 bandwidth (∼100 GB/s $100\,GB/s$) and assuming a similar scan pattern over the flash-backed KV index. IV.5 Latency Model The latency breakdown (Supplementary table S2) shows a total photonic pipeline latency of ∼9 ns $9\,ns$, dominated by the CMOS top-k sorting logic (5 ns5\,ns) and TIA+ADC conversion (2 ns2\,ns). The total Prism latency of ∼ 9 ns9\,ns compares favorably with the electronic baselines: GPU full scan ∼ 5 µs5\, s, GPU ANN ∼ 1 µs1\, s, and NVIDIA ICMS ∼ 0.5 µs0.5\, s—representing a ∼500× 500× speedup over full scan. However, this comparison must account for the additional latency of fetching the selected KV blocks from memory after Prism selection, which adds 0.5 µs to 2 µs0.5\, s2\, s depending on the memory tier (HBM vs. flash). The net latency benefit of Prism is therefore most pronounced when the selection ratio k/Nk/N is small and the KV cache resides in a slow memory tier (e.g., flash in ICMS). The crossover analysis quantifying these trade-offs across context lengths and baselines is presented in Sec. VI.5. Figure 8 summarises the interplay between signature dimension, photodetector SNR, and ranking accuracy across the operating envelope of Prism. Figure 8: SNR and recall analysis. (a) Electrical SNR at the photodetector as a function of signature dimension d for three bank sizes. (b) Recall@8 vs. SNR showing that reliable top-k selection (>90>90% recall) requires SNR ≳15 dB $15\,dB$. Balanced photodetection noise. In the balanced configuration, each channel uses two photodetectors measuring through-port and drop-port signals independently [55]. Shot noise from both PDs adds in quadrature: σI2=2e(Ithrough+Idrop)Δf _I^2=2e(I_through+I_drop) f. Since Ithrough+Idrop=ℛP0I_through+I_drop=RP_0 (power conservation), the total shot noise is weight-independent, simplifying the noise analysis. The factor of 2 2 increase in noise is offset by the doubled signal dynamic range ([−1,+1][-1,+1] vs [0,1][0,1]). V System-Level Evaluation This section evaluates the complete Prism pipeline from algorithmic profiling through end-to-end validation. We first profile retrieval heads across two LLM families (Sec. V.1), then evaluate block signature design and recall (Sec. V.2), and validate downstream accuracy via Needle-in-a-Haystack experiments with MRR-simulated block selection (Sec. V.3). V.1 Retrieval-Head Analysis Models and datasets. We profile two representative open-weight LLMs: Qwen2.5-7B-Instruct [43] (L=28L=28, H=28H=28, dh=128d_h=128, GQA [2] with HKV=4H_KV=4; total 112 KV heads) and Qwen3-8B [45] (L=36L=36, H=32H=32, dh=128d_h=128, GQA with HKV=8H_KV=8; total 288 KV heads). Qwen2.5-7B supports context lengths of at least 128 000128\,000 tokens; Qwen3-8B supports up to 32 00032\,000 tokens. We compute retrieval ratios Rh(ℓ,h)R_h^( ,h) (eq. 5) on a calibration set of 2–3 random token sequences per context length, with w=256w=256 as the local window size. The retrieval ratio is measured using a two-step procedure: SDPA-based prefill followed by eager last-token attention extraction. All experiments are run on an NVIDIA RTX 5880 (48 GB VRAM) for bf16 models, and an NVIDIA RTX 5070 (12 GB) for 4-bit quantized variants. We additionally verify consistency between bf16 and 4-bit quantized Qwen2.5-7B, finding that quantization does not substantially alter retrieval head identification (e.g., 91.1% vs. 92.0% at 8K context for bf16 and 4-bit, respectively). Results. Figure 9 shows the retrieval ratio heatmap across all layers and heads for both models. Figure 9: Retrieval ratio Rh(ℓ,h)R_h^( ,h) for each KV head across all layers. Heads with Rh>0.3R_h>0.3 (dashed line) are classified as retrieval heads. (a) Qwen2.5-7B: 102/112 heads are retrieval heads (91.1%) at 8K context. (b) Qwen3-8B: 258/288 heads are retrieval heads (89.6%) at 8K context. Detailed per-context retrieval ratios are provided in Supplementary table S3. Figure 10 visualizes the retrieval head fraction and mean retrieval ratio as a function of context length. Figure 10: Retrieval head statistics vs. context length. (a) Retrieval head fraction Rh(τ=0.3)R_h(τ=0.3) for Qwen2.5-7B (bf16 and 4-bit) and Qwen3-8B (bf16). The fraction exceeds 90% for n≥8n≥ 8K and approaches 99% at 256K context. (b) Mean retrieval ratio R¯h R_h for Qwen2.5-7B (bf16), showing that individual-head retrieval strength also increases with context length. We observe the following patterns: • Ubiquity of retrieval behavior. At a threshold of τ=0.3τ=0.3, 91.1% of KV heads in Qwen2.5-7B and 89.6% in Qwen3-8B are retrieval heads at 8K context. This prevalence increases with context length: for Qwen2.5-7B, the fraction rises from 83.9% at 2K to 98.2% at 128K context, indicating that nearly all heads engage in long-range retrieval at long contexts. This monotonic increase creates a compounding advantage for photonic acceleration: not only does the per-head benefit of O(1)O(1) photonic computation grow with context, but the fraction of heads requiring acceleration also increases. Note that at the more permissive τ=0.1τ=0.1 threshold used in [66], essentially 100% of heads qualify as retrieval heads. The reported fraction is thus sensitive to the threshold choice: varying τ from 0.1 to 0.3 shifts the classified fraction from ∼100% 100\% to ∼90% 90\%. The contrast with DuoAttention’s 25–50% retrieval fraction reflects both (i) different models (Llama-2/Mistral vs. Qwen) and (i) DuoAttention’s use of a learned gating function optimized on calibration data, which imposes a stricter criterion than a simple threshold on attention mass. In practice, the threshold can be tuned per deployment scenario to trade off between the number of heads served photonically and the complexity of the photonic accelerator. • Layer distribution. The highest-scoring retrieval heads are concentrated in layers 14–26, with peak retrieval ratios exceeding 0.93. • GQA effect. Because GQA shares KV heads across multiple query heads, the number of KV cache entries requiring retrieval-style treatment is even smaller than the head count suggests. The key implication for Prism is that the photonic accelerator needs to serve the vast majority of KV heads—102 out of 112 for Qwen2.5-7B and 258 out of 288 for Qwen3-8B at 8K context. However, GQA sharing means each KV head serves multiple query heads, so the number of independent weight bank instances required equals the KV head count, not the query head count. V.2 Block Signature Design We partition the KV cache into contiguous blocks of B tokens and compute a d-dimensional signature for each block [22]. We evaluate mean-key and random projection signature methods from Sec. I.2 at block size B=128B=128 and signature dimensions d∈16,32,64,128d∈\16,32,64,128\, using Qwen2.5-7B at context length n=4096n=4096. Our experiments identify B=128B=128 with d=32d=32 and mean-key projection as the best configuration. At the primary operating point k=32k=32, table 4 shows R@32 = 100% at 8K context (B=128B=128, 64 blocks), confirming that the signature ranking correctly identifies all relevant blocks. At 16K, R@32 drops to 57.5%, yet downstream NIAH accuracy remains 100% (table 6), indicating that task-critical blocks are consistently ranked in the top-k even when overall recall is imperfect. This significant gap between block-level recall and task accuracy demonstrates an inherent error tolerance: task-relevant information is distributed across neighboring blocks, providing natural redundancy that absorbs photonic impairment noise. As a stress-test analysis at k=8k=8, R@8 = 77.3% (R@2 = 31.3%, R@4 = 50.0%), confirming that useful ranking signal persists even under aggressively small selection budgets. Mean-key projection consistently outperforms random projection across all tested dimensions, confirming that the natural key-space geometry contains exploitable structure for block ranking. Why mean-key and random projections? We focus on mean-key and random projection signatures because they are model-agnostic and require no training, matching our goal of a general-purpose photonic hardware interface. Learned projections (e.g., trained linear maps optimized for recall) could improve signature quality but would require per-model fine-tuning and hardware-aware training, which we leave to future work (Sec. VII.3). Figure 11: Recall@k as a function of signature dimension d for different signature methods. Block size B=128B=128, k=8k=8 (stress-test setting). Mean-key projection consistently outperforms random projection, achieving 77.3% recall@8 at d=32d=32. At the primary operating point k=32k=32, recall reaches 100% at 8K context (table 4). Signed weight encoding. The add-drop MRR configuration with balanced photodetection enables direct encoding of signed weights w∈[−1,+1]w∈[-1,\,+1], eliminating the non-negative constraint of through-port-only architectures. Compared to ReLU projection (which discards sign information, losing ∼50% 50\% of the signature variance for zero-mean Gaussian signatures), balanced photodetection preserves the full signed inner product. Our simulations show that signed encoding improves Recall@8 by ∼87% 87\% relative to ReLU projection at d=32d=32 (Supplementary fig. S6). Recall metric. We define recall@k [31] as the fraction of the true top-k blocks (by exact query–key inner product) that appear in the Prism-selected top-k blocks: Recall@k=|PRISM∩exact|k,Recall@k= |S_PRISM _exact|k, (13) where PRISMS_PRISM and exactS_exact are the sets of top-k block indices selected by Prism and by exact computation, respectively. Table 4: Recall@k for Prism block selection across context lengths. Qwen2.5-7B, B=128B=128, d=32d=32, mean-key projection. Values averaged over 15 (layer, head) pairs. n Blocks R@8 (%) R@16 (%) R@32 (%) NIAH (%) 4K 16 46.7 100 —∗ 100 8K 32 29.2 55.8 100 100 16K 64 26.7 41.7 57.5 100 32K 128 (OOM†) 100 64K 256 — 100 ∗Only 16 blocks at 4K; k=32k=32 exceeds total. †Eager attention OOM at 32K; NIAH uses SDPA (no attention matrix). Traffic reduction. At the primary operating point k=32k=32, the traffic ratio is kB/n=32×128/nkB/n=32× 128/n. At 128K tokens (N=1024N=1024 blocks), PRISM selects k=32k=32 of N=1024N=1024 blocks, yielding a N/k=1024/32=32×N/k=1024/32=32× traffic reduction (3.1% traffic). At 1M tokens (N≈7812N≈ 7812 blocks), the reduction grows to N/k≈7812/32≈244×N/k≈ 7812/32≈ 244× (0.41% traffic), though model accuracy at such lengths remains model-dependent. Under the stress-test setting k=8k=8, the reduction reaches 128×128× at 128K and projects to ∼977× 977× at 1M tokens (fig. 12). Figure 12: Memory traffic reduction factor as a function of context length for fixed block size B=128B=128 and k=8k=8 selected blocks. The traffic reduction scales linearly with context length, reaching 977×977× at n=1 000 000n=$1\,000\,000$. V.3 NIAH Accuracy Under Hardware Impairments To validate that the MRR-impaired block selection preserves end-to-end language model performance, we integrate the MRR array simulator into Qwen2.5-7B [43] and evaluate on the Needle-in-a-Haystack (NIAH) benchmark [64, 25]. For each decode step, block signatures (mean-key, d=32d=32) are processed through the MRR simulator to select the top-k blocks. Retrieval heads (Rh>0.3R_h>0.3; Supplementary table S3) use MRR-selected blocks plus a 256-token recent window; streaming heads retain full attention. We test four MRR configurations: (i) ideal (floating-point inner product), (i) 5-bit/20 pm20\,pm drift (nominal), (i) 4-bit/30 pm30\,pm drift (pessimistic), and (iv) 5-bit/10 pm10\,pm drift (optimistic). Table 5: NIAH accuracy (%) with MRR-integrated block selection (Qwen2.5-7B, 11 positions, k=8k=8 stress-test setting). Configuration 2K 4K 8K Full attention 90.9 100.0 100.0 Ideal select 90.9 100.0 100.0 5-bit, 20 pm 90.9 100.0 100.0 4-bit, 30 pm 90.9 100.0 100.0 5-bit, 10 pm 90.9 100.0 100.0 Table 5 shows that all four MRR configurations—including the worst-case 4-bit quantization with 30 pm30\,pm thermal drift—achieve identical NIAH accuracy to full attention at all tested context lengths. The single miss at 2K context (position 50%) is a model-level artifact unrelated to block selection. These results demonstrate that the MRR impairments modelled in Sec. IV.1 do not degrade downstream task accuracy for the block-selection ranking task. To validate Prism across a wide range of context lengths, we extend the evaluation using SDPA-based attention (Flash Attention) with KV cache offloading to CPU RAM via OffloadedCache. This enables experiments at context lengths from 4K to 128K on a single GPU (RTX 5880, 48 GB48\,GB) with 128 GB128\,GB system RAM. We note that Qwen2.5-7B’s native context window is 128K tokens; at 128K, the base model’s own accuracy degrades to 45.5% on NIAH (table 6), limiting meaningful evaluation beyond 64K. Extrapolation to longer contexts (e.g., 1M tokens via YaRN [44] rope scaling) is technically feasible for the photonic hardware, but model-level accuracy at such lengths remains an open challenge independent of block selection. For sparse evaluation, we employ physical token selection: rather than re-attending to all tokens with a mask, only the tokens from the top-k selected blocks and a recent window are assembled into a compact input (∼ 5K tokens), preserving positional encoding via explicit position_ids. This approach mirrors the actual deployment scenario where only selected KV blocks are fetched from memory. Table 6 presents the extended NIAH results across context lengths from 4K to 128K. The full 2D NIAH heatmap (context length × needle depth, 10 positions per context) is shown in Supplementary fig. S5. At k=32k=32 blocks (B=128B=128), all MRR configurations achieve 100% accuracy from 4K through 64K, perfectly matching full attention. At 128K, the base model itself degrades to 45.5%—a known limitation of Qwen2.5-7B’s context window—making sparse-vs-full comparison uninformative at this length. Within the model’s reliable operating range (N≤64N≤ 64K), MRR block selection introduces zero accuracy penalty while reducing KV memory traffic by 16×16× at 64K (k⋅B/n=32×128/65536=6.25k· B/n=32× 128/65536=6.25%); the reduction grows to 32×32× (3.1%) at 128K. Table 6: Extended NIAH accuracy (%) with MRR block selection (Qwen2.5-7B, B=128B=128, d=32d=32, k=32k=32). At 128K the base model degrades († ). ∗50 needle positions; others use 10 positions. Configuration 4K∗ 8K∗ 16K 32K 64K 128K† Full attention 100 100 100 100 100 45.5 Ideal (k=32k=32) 100 100 100 100 100 18.2‡ 5-bit, 20 pm 100 100 100 100 100 27.3‡ 4-bit, 30 pm 100 100 100 100 100 27.3‡ 5-bit, 10 pm 100 100 100 100 100 27.3‡ Needle block recall (MRR top-k selection): 5-bit, 20 pm 100∗ 88∗ — — — — ∗Extended to 50 needle positions (vs. 10 for other columns), confirming robustness of the 10-position findings. Needle block recall at 8K (88%) indicates 6 of 50 needle blocks fall outside the top-k=32k=32 selection, yet NIAH answer accuracy remains 100% because surrounding context blocks contain sufficient retrieval cues. ‡At 128K, full attention itself degrades to 45.5%; the apparent superiority of impaired configurations over ideal is within sampling noise and is not statistically significant. VI Photonic Scaling Analysis We now analyze how the photonic engine scales to larger systems, identifying constraints from WDM channel density, thermal power, chip area, and time-multiplexed operation. VI.1 MRR Integration Scaling The total MRR count in the Prism weight bank is NMRR=d×N,N_MRR=d× N, (14) where d is the number of WDM wavelength channels (signature dimension) and N is the number of parallel signature banks (one per KV cache block). For a context length of n tokens with block size B, N=n/BN=n/B. Table 7 lists representative configurations spanning three orders of magnitude in MRR count. Table 7: MRR count for representative Prism configurations. The rightmost column indicates the approximate context length supported at block size B=128B=128. d N NMRRN_MRR Context Feasibility 32 256 81928192 32K Current TFLN 64 1024 65 53665\,536 128K Near-term 128 4096 524 288524\,288 512K Multi-chip Current photonic integration supports 10410^4–10510^5 active devices per die [21, 50], placing the d=32d=32, N=256N=256 configuration within demonstrated capability and d=64d=64, N=1024N=1024 at the near-term frontier. The d=128d=128, N=4096N=4096 configuration exceeds single-chip density, requiring chiplet-based multi-chip modules [47] (Sec. VI.3). VI.2 Thermal Power Budget and WDM Channel Limits On thermo-optic SOI platforms [39], each MRR requires ∼2.5 mW $2.5\,mW$ of static heater power, yielding aggregate budgets of 20 W20\,W (d=32d=32, N=256N=256) to 164 W164\,W (d=64d=64, N=1024N=1024)—approaching the ∼200 W $200\,W$ practical limit with active cooling. On the TFLN platform, MRR tuning via the Pockels effect (r33=30.9 pm/Vr_33=$30.9\,pm/V$) is capacitive with near-zero static power (<1 µW<$1\, W$ per ring from CMOS driver leakage): PstaticTFLN=d×N×Pleakage<d×N×1 µW.P_static^TFLN=d× N× P_leakage<d× N×$1\, W$. (15) For the d=64d=64, N=1024N=1024 configuration, Pstatic<0.07 WP_static<$0.07\,W$—a ∼2400× 2400× reduction over SOI. The switching energy (∼5 fJ $5\,fJ$ per ring) yields <0.3 µW<$0.3\, W$ total switching power at typical decode rates—negligible. Residual thermal stabilization via TEC (∼1 W $1\,W$ for a ∼1 cm2 $1\,cm^2$ chip) remains necessary but is orders of magnitude below SOI heater budgets. TFLN’s lower thermo-optic coefficient (dn/dT≈4×10−5 K−1dn/dT≈$4×10^-5\,K^-1$ vs. 1.8×10−4 K−11.8×10^-4\,K^-1 for Si) further reduces thermal crosstalk. WDM channel limits. The signature dimension d is constrained by the MRR free spectral range (FSR) [11]. A single-FSR MRR (R=20 µmR=$20\, m$, FSR ≈8.3 nm≈$8.3\,nm$) supports only ∼5 5 channels at 200 GHz200\,GHz spacing. Vernier-coupled dual-ring filters extend the effective FSR to ∼50 nm $50\,nm$ (d∼30d 30); C+L band operation (95 nm95\,nm) enables d∼60d 60. Achieving d=128d=128 requires FSR extension with C+L+S band operation (Supplementary Section S3). VI.3 Chip Area Estimation The 32×25632× 256 configuration (81928192 MRRs) fits on a single ∼5×5 5× 5 m2 die; the 64×102464× 1024 configuration requires multi-chip packaging or folded layouts. Detailed area estimates are provided in Supplementary Section S4. Figure 13 summarizes the scaling trend. Figure 13: Prism photonic scaling projections. MRR count, aggregate heater power (SOI), and estimated chip area as functions of the configuration parameters d and N. The dashed horizontal lines indicate practical limits: 200 W200\,W thermal dissipation (active cooling) and 858 m2858\,m^2 single-reticle area. Configurations below both limits (shaded region) are realizable on a single photonic chip. VI.4 Time-Multiplexed Operation Area and power constraints can be relaxed by trading physical parallelism for temporal reuse via time-multiplexed weight programming [5]. The system deploys NphysN_phys physical rows and cycles through M weight configurations: Nlogical=M×Nphys,M=⌈N/Nphys⌉.N_logical=M× N_phys, M= N/N_phys . (16) On TFLN, EO reprogramming is sub-nanosecond (treprogram≪topticalt_reprogram t_optical), so the total latency simplifies to ttotal≈M×topticalt_total≈ M× t_optical. Even at M=8M=8, the total latency (80 ns80\,ns) remains four orders of magnitude below the GPU full-scan baseline (∼200 µs $200\, s$)—a fundamental advantage over thermo-optic SOI (treprogram∼10 µst_reprogram $10\, s$). For LLM decode at 128K+ context, M=4M=4–88 is a practical sweet spot: it reduces physical MRR count by 44–8×8× (to 81928192–16 38416\,384), keeps chip area within a single reticle, and resolves the area scaling barrier of Sec. VI.3, making d=64d=64 realizable with current TFLN technology (fig. 14). Figure 14: Time-multiplexed Prism operation. (a) Physical MRR count vs. multiplexing factor M (d=64d=64, N=1024N=1024). (b) Query evaluation latency vs. M, compared to electronic baselines. The shaded region indicates the sweet spot (M=4M=4–88). Batch serving with page-by-page evaluation. When the number of cached blocks N exceeds the physical chip capacity NchipN_chip, the system pages through M=⌈N/Nchip⌉M= N/N_chip configurations per evaluation. For batch serving with BbatchB_batch concurrent users, the total selection time per decode step is tselect=Bbatch×HKV×L×M×(treprogram+toptical),t_select=B_batch× H_KV× L× M×(t_reprogram+t_optical), (17) where HKVH_KV is the number of KV heads and L the number of layers (time-multiplexed across heads and layers on a single chip). For Qwen2.5-7B (HKV=4H_KV=4, L=28L=28) with Nchip=1024N_chip=1024, treprogram=4 nst_reprogram=$4\,ns$, and toptical=9 nst_optical=$9\,ns$: at 1M context (M=8M=8), batch = 128 yields tselect=1.5 mst_select=$1.5\,ms$; at 10M (M=80M=80), tselect=14.9 mst_select=$14.9\,ms$. Adding model weight reads (4.2 ms4.2\,ms) and KV block fetch (8.6 ms8.6\,ms), the total per-token decode latency remains 3.5×3.5×–5.3×5.3× lower than GPU block selection at 10M–100M context (fig. 16). Unlike GPU full scan, where latency is fundamentally bandwidth-limited at O(N)O(N), Prism paging latency grows as O(N/Nchip)O(N/N_chip) with a much smaller constant (13 ns13\,ns per page vs. ∼8 ns $8\,ns$ per block for HBM reads), and can be eliminated entirely by scaling NchipN_chip with additional parallel banks. VI.5 Energy and Latency Crossover We define the crossover point n∗n^* as the context length at which Prism-assisted decoding cost equals the electronic baseline. The Prism cost (photonic selection energy ∼2290 pJ $2290\,pJ$ per query plus reduced GPU fetch) is compared against the GPU full-scan cost (fetching all N blocks via HBM). On TFLN, near-zero static power means the selection cost is dominated by dynamic components. The full derivation is in Supplementary Section S2. Figure 15: Energy crossover map for Prism vs. electronic baselines. The crossover contour (CPRISM/Cbaseline=1C_PRISM/C_baseline=1) shifts to shorter context lengths as d decreases. (a) vs. GPU full scan: practical benefit at n∗≈4n^*≈ 4K. (b) vs. GPU ANN: n∗≈2n^*≈ 2K. (c) vs. NVIDIA ICMS: n∗≈4n^*≈ 4K. Energy crossover. Against the GPU full scan (fig. 15a), the mathematical crossover occurs at n∗<1n^*<1K tokens (d=64d=64, Nbank=4N_bank=4); practical benefit emerges at n≥4n≥ 4K where traffic reduction exceeds 8×8×. The per-query dynamic energy (∼2290 pJ $2290\,pJ$; table 3) is four orders of magnitude below the H100 fetch energy at 128K context (∼16.3 µJ $16.3\, J$). This GPU baseline assumes a full-dimension scan (dh=128d_h=128). A fairer comparison lets the GPU scan compressed d=32d=32 signatures, reducing scan energy to Escanfair≈12 µJE_scan^fair≈$12\, J$. Even under this fairer comparison where the GPU scans compressed d=32d=32 signatures (∼12μ 12\, ), Prism maintains a ∼5000× 5000× advantage (∼2290 pJ $2290\,pJ$ vs. ∼12 µJ $12\, J$), preserving a comfortable crossover margin. On thermo-optic SOI, the ∼164 W $164\,W$ heater power would place the crossover at n∗≈4n^*≈ 4K. Against GPU ANN (FAISS IVF-PQ) [23] (O(N)O( N) scan reduction), the crossover is at n∗≈2n^*≈ 2K. Against NVIDIA ICMS (DPU with lower bandwidth than GPU HBM), n∗≈4n^*≈ 4K based on estimated GTC 2024 specifications. Latency crossover. The ∼9 ns $9\,ns$ photonic evaluation is orders of magnitude below the ∼5 µs $5\, s$ GPU scan, so the selection step is effectively free in latency terms (n∗≲4n^* 4K tokens). Sensitivity. The dominant factors are: (i) dynamic power (∼254 mW $254\,mW$, dominated by TIAs/ADCs, laser, and DACs; table 3); (i) signature dimension d (controls MRR count and area, traded against recall quality, Sec. V.2); (i) bank count NbankN_bank (splitting loss vs. parallelism); and (iv) HBM bandwidth (HBM4 improvements shift the crossover to longer contexts). Scaling projections. The energy ratio CPRISM/CGPUC_PRISM/C_GPU decreases as ∼1/n 1/n because electronic scan cost grows linearly while Prism selection cost is fixed. At n=1 000 000n=$1\,000\,000$ (N≈8000N≈ 8000 blocks), the GPU reads ∼1 MB $1\,MB$ of signatures per head per query; Prism accommodates this with Nbank=8N_bank=8 banks (512 000512\,000 MRRs total). In multi-agent scenarios, a single weight bank serves A agents simultaneously (only the query sketch changes), amortizing dynamic power by 1/A1/A. Figure 16: Per-token decode latency vs. context length for batch serving (batch = 128, Qwen2.5-7B, B=128B=128, k=32k=32). GPU full-scan latency (dashed gray) is shown for four hardware generations: H100 (3.35 TB/s) [33], B200 (8 TB/s) [34], Vera Rubin (22 TB/s) [36], and Rubin Ultra (32 TB/s HBM4e) [37]. GPU block selection (Quest et al., dark gray) performs O(N)O(N) signature scan from HBM. Prism (red, Nchip=1024N_chip=1024) uses page-by-page time-multiplexing (eq. 17) when N>NchipN>N_chip, growing as O(N/Nchip)O(N/N_chip) but with a per-page cost of only 13 ns13\,ns. Latency is computed as t=tweights+tselect+tfetcht=t_weights+t_select+t_fetch, where tweights=14GB/BWHBMt_weights=14\,GB/BW_HBM and tfetch=batch×kB×2dhbprec×HKVL/BWHBMt_fetch=batch× kB× 2d_hb_prec× H_KVL/BW_HBM. VII Discussion VII.1 Limitations and Practical Considerations All hardware results in this work are based on device-level simulations with parameters extracted from FDTD and supplemented by literature values; no physical prototype has been fabricated or measured. The impairment models, while grounded in FDTD simulation and published device data, may not capture all fabrication-dependent effects such as waveguide roughness variations, EO electrode non-uniformity across a large array, and packaging-induced stress. At d=64d=64 and N=1024N=1024, the system requires 65 53665\,536 MRRs; systematic characterization of >>10 00010\,000-MRR arrays on TFLN has not been reported, though recent progress in TFLN foundry processes suggests that large-scale integration is feasible [19]. Fabrication non-uniformity causes resonance wavelength variations of σλ∼ _λ 0.5 nm to 2 nm0.5\,nm2\,nm across a wafer [54], but on the TFLN platform EO tuning can compensate via DC bias adjustment without static power penalty. Residual thermal drift, while mitigated by lithium niobate’s ∼4× 4× lower thermo-optic coefficient compared to silicon, still requires chip-level thermal stabilization (∼1 W $1\,W$ TEC budget). The 1 W1\,W TEC is a fixed overhead independent of query rate; at a throughput of T head-queries per second the amortized TEC contribution is 1/T1/T joules per head-query. For T=105T=10^5 (e.g., 100 concurrent users × 1000 tokens/s), TEC adds ∼10 µJ $10\, J$ per head-query, bringing the total to ∼12.3 µJ $12.3\, J$. The 2290 pJ2290\,pJ value cited elsewhere in this paper refers to the dynamic photonic pipeline alone. Interface latency. The ∼9 ns $9\,ns$ latency reported for Prism reflects the photonic pipeline alone (DAC through top-k selection) and does not include the host interface overhead. A PCIe 5.0 round-trip (DMA setup and transfer) adds ∼ 1 µs to 2 µs1\, s2\, s; CXL-attached memory semantics reduce this to ∼ 200 ns to 500 ns200\,ns500\,ns; direct interposer or co-packaged integration would add only ∼ 10 ns to 50 ns10\,ns50\,ns. Even with PCIe overhead, the total system latency of ∼2 µs $2\, s$ remains below the GPU full-scan latency (∼5 µs $5\, s$), yielding a system-level 22–3×3× speedup. Co-packaging—the long-term integration target—would preserve the ∼100× 100× raw photonic advantage. Speedup claims should therefore be interpreted as system-level 22–3×3× with PCIe, potentially 100×100× with co-packaging. Demonstrated vs. projected scale. To clarify the maturity of the MRR integration scales assumed in this work: demonstrated TFLN arrays have reached ∼ 10–100 MRRs [19], while SOI platforms have demonstrated ∼ 10001000–10 00010\,000 MRRs [21]. Prism’s “current” configuration (81928192 MRRs at d=32d=32, N=256N=256) is a projected design point that extrapolates from these demonstrations; the flagship configuration (65 53665\,536 MRRs at d=64d=64, N=1024N=1024) is also projected and would likely require multi-chip or wafer-scale integration. At d=64d=64 and N=1024N=1024, the system requires 65 53665\,536 individually addressable voltage bias lines for fabrication-offset compensation of each MRR, presenting a significant packaging and routing challenge that will require advanced fan-out or interposer-based solutions. The add-drop MRR configuration with balanced photodetection resolves the sign limitation of through-port-only architectures. The balanced differential photocurrent Ithrough−IdropI_through-I_drop naturally encodes signed weights in [−1,+1][-1,+1], enabling true signed inner products without ReLU projection or split encoding. The trade-off is a doubling of the photodetector count (two PDs per channel), but since PDs are orders of magnitude smaller than MRRs, the area penalty is negligible. The retrieval head classification threshold τ=0.3τ=0.3 used throughout this work becomes less discriminative at longer contexts, where most heads tend to exhibit high retrieval scores; DuoAttention’s learned gating identifies only 25–50% of heads as retrieval heads. The 90%+ fraction reported here should therefore be interpreted as an upper bound estimate at the evaluated context lengths. For multi-head serving, GQA [2] reduces the number of independent weight bank instances from the retrieval head count (102 for Qwen2.5-7B) to the KV head count (HKV=4H_KV=4), since block signatures are derived from key vectors at KV-head granularity. This synergy deepens as the field trends toward more aggressive key–value sharing: under multi-query attention (HKV=1H_KV=1), a single weight bank per layer would suffice. These 4 heads can be served by time-multiplexed reprogramming (∼4 ns $4\,ns$ on TFLN, negligible vs. the ∼5 µs $5\, s$ KV fetch) or by parallel replication of 4 weight banks. When the layer dimension is included, the full configuration space is HKV×L=4×28=112H_KV× L=4× 28=112 weight bank instances per decode step. Under time-multiplexing, this amounts to 112×∼1 ns≈112 ns112× $1\,ns$≈$112\,ns$ total reprogramming overhead—still ∼45× 45× smaller than a single KV block fetch (∼5 µs $5\, s$) and therefore negligible in the decode-step budget. Alternatively, a layer-parallel deployment with 28 Prism chips (one per layer, each serving 4 KV heads) would eliminate the layer serialization entirely at the cost of additional chip area. VII.2 Comparison with Related Approaches The block-level top-k selection mechanism at the core of Prism builds on a strategy independently validated by several works: Quest [58] preserves over 99% of full-attention accuracy on long-context benchmarks including NIAH up to 1M tokens, DuoAttention [66] maintains LongBench performance within 1–2% of full attention, and InfLLM [65] and RocketKV [6] provide additional evidence for block-level selection at long context. Prism’s contribution is orthogonal: the key question is not whether block selection preserves quality (answered affirmatively above) but whether MRR-based analog computation introduces sufficient error to degrade the selection. Our NIAH results (Sec. V.3) confirm that it does not, even under pessimistic hardware impairments. Tian et al.’s photonic transformer chip (PTC) [59] demonstrates that coherent optical interference can implement full transformer attention with high throughput (>>200 POPS); however, it targets dense attention computation rather than the coarse block-selection task addressed by Prism, and its O(n)O(n) memory access scaling remains for long-context KV caches. InfLLM is the most directly comparable system, as it offloads the full KV cache to CPU RAM and retrieves blocks via electronic inner products. The key distinction is selection latency scaling: InfLLM’s selection time grows as O(N)O(N) with the number of cached blocks, while Prism’s photonic engine evaluates all N similarities in O(1)O(1) optical transit time. This advantage grows with context length—precisely the regime where the KV cache bottleneck is most severe. Relative to Quest [58] and RocketKV [6], which perform block selection digitally on the GPU, Prism targets a different bottleneck: these methods reduce compute by pruning low-scoring KV blocks but still require the GPU to read all block signatures from HBM (costing O(N)O(N) memory traffic per decode step). In short, RocketKV reduces the number of fetched KV blocks, but still requires scanning all candidates; Prism eliminates the scan itself. The key enabler is that Prism stores block signatures directly in the MRR weight bank (programmed incrementally as each block completes, every B tokens), so no HBM read is needed at decode time—whereas Quest and RocketKV must read all N signatures from HBM at every decode step. Quest- or RocketKV-style scoring policies could be used to define which blocks are selected, while Prism accelerates the execution of that selection. Table 8 quantifies how the scan fraction of total HBM traffic grows with context length, showing that the signature scan—which Prism eliminates—becomes the dominant cost at long contexts. Table 8: Signature scan as a fraction of total decode-step HBM traffic (electronic block selection, d=32d=32, k=32k=32, B=128B=128, dh=128d_h=128). At million-token contexts the scan dominates; Prism removes this cost entirely. Context N (blocks) Scan / step Scan fraction 16K 128 8 KB 0.4% 128K 1,024 64 KB 3% 1M 8,192 512 KB 20% 10M 81,920 5 MB 71% 100M 819,200 50 MB 95% Electronic ASIC alternative. An electronic ASIC could, in principle, also implement parallel selection through massive parallelism. However, such an approach requires N-way replicated compute units, local storage for all N signatures, and high-fan-out interconnect, leading to scaling challenges in area, power, and routing complexity as N grows. In contrast, Prism leverages the natural broadcast property of photonic systems, where a single query can be passively distributed to all candidates without explicit data movement, making scan-free selection physically native to the hardware rather than architecturally forced. The key contribution of this work is not limited to a specific photonic implementation, but rather the identification of selection as a memory-read bottleneck and the demonstration that removing the O(N)O(N) scan fundamentally changes system scaling. Prism should be viewed as a hardware instantiation of scan-free selection, particularly natural in photonic broadcast systems. Table 9 summarizes the fundamental distinction between these approaches. Table 9: Comparison of KV cache block-selection strategies (N=1024N=1024 blocks, d=32d=32, k=32k=32, H100 GPU baseline). GPU values estimated from H100 datasheet (3.35 TB/s HBM3, 0.71 pJ/MAC); Prism values from device-physics simulation. GPU–HBM GPU block Prism full scan selection selective fetch Scan eliminated? No No Yes Selection latency ∼ 5 μ ∼ 1–5 μ ∼ 9 ns Selection energy ∼ 16.3 μ ∼ 4–16 μ ∼ 2.3 nJ HBM traffic 100% ∼ 3% + scan ∼ 3% only Scales with N O(N)O(N) O(N)O(N) () O(1) The GPU ANN baseline used in our crossover analysis (FAISS IVF-PQ [23]) represents a well-established but not state-of-the-art GPU search library; more recent GPU-accelerated ANN libraries (e.g., CAGRA, cuVS) may further reduce the electronic baseline latency and energy, narrowing the crossover window. NVIDIA’s ICMS [35] addresses the complementary capacity problem (terabyte-scale flash-backed KV storage with DPU-managed prefetch), while Prism solves the selection problem via photonic parallel inner products. Note that the ICMS energy and bandwidth specifications used in our comparisons are estimated from public announcements; no published measurements are available, and actual performance may differ. A natural integration would place Prism within or adjacent to the ICMS, combining storage capacity with photonic selection speed. The recently announced NVIDIA Rubin platform [4] further underscores industry momentum toward dedicated KV cache acceleration hardware, complementary to Prism’s photonic approach. Complementary compression. KVTC [52] applies transform coding (PCA decorrelation, adaptive quantization, and entropy coding) to compress KV caches by ∼20× 20× with negligible accuracy loss (ICLR 2026). KVTC and Prism are orthogonal: KVTC reduces the size of each stored block (capacity problem), while Prism eliminates the O(N)O(N) signature scan for block selection (access pattern problem). Combining both would yield simultaneous 20×20× storage reduction and O(1)O(1) selection. Note that KVTC’s PCA-transformed representations may require adapting the block signature computation to the decorrelated basis, which we leave for future work. VII.3 Outlook The immediate next step is fabrication of a small-scale TFLN MRR prototype (8×88× 8 weight bank) to validate inner-product accuracy under real device impairments and provide measured values for parameters currently extracted from simulation. Scaling to a full module (d=64d=64, N=256N=256) integrated with GPU-based LLM inference would validate the crossover predictions of Sec. VI.5. Integrating non-volatile weight storage (e.g., phase-change trimming [60, 1]) could further reduce switching energy for quasi-static block signatures [12, 69]. More challenging benchmarks such as SCBench [27] and query-focused retrieval analysis [70] would strengthen confidence in the robustness of photonically selected blocks beyond the NIAH validation presented here. Practical integration. A deployable Prism module would package the photonic chip, laser source, and TEC onto a single substrate, offered in one of three form factors: a PCIe add-in card for drop-in datacenter use, a CXL-attached device for lower-latency memory-semantic access, or a co-packaged chiplet on an interposer for maximum performance. Integration with existing LLM serving stacks (e.g., vLLM, TensorRT-LLM) would proceed via a block-index API: the host submits a query sketch and receives ranked block indices, transparently replacing the software signature-scan kernel. Benchmark scope. NIAH is a retrieval-oriented benchmark that tests single-needle recall; it does not exercise multi-hop reasoning, summarization, or other long-context capabilities. To assess generalization beyond NIAH, we evaluated Prism block selection on three LongBench-v2 domains (table 10). Across all domains, MRR-impaired block selection (5-bit, 20 pm drift) produces identical accuracy to full attention, confirming that the block-selection approximation introduces no measurable quality loss on diverse long-context tasks. Table 10: LongBench-v2 accuracy (%) with Prism block selection (Qwen2.5-7B, k=32k=32, B=128B=128, d=32d=32, 4K context, 5-bit quantization, 20 pm drift). Domain Full attn. Prism Drop Multi-Document QA 16.7 16.7 0.0 Single-Document QA 23.3 23.3 0.0 Long In-context Learning 26.7 26.7 0.0 Overall 22.2 22.2 0.0 We note that the block selection mechanism is inherited from Quest [58] and InfLLM [65], which have been validated on broader benchmarks (LongBench, ∞Bench); Prism’s contribution is the photonic hardware mapping of this selection, not the selection algorithm itself. VIII Conclusion We have presented Prism, a TFLN photonic similarity engine that computes all N block-selection inner products in O(1)O(1) optical latency via the broadcast-and-weight paradigm. End-to-end NIAH evaluation confirms that MRR-selected block-sparse attention preserves full-attention accuracy from 4K to 64K tokens (within the model’s native context window) under realistic hardware impairments (4–5 bit weights, 30 pm30\,pm thermal drift), while reducing KV cache traffic by 16×16× at 64K context (k=32k=32, B=128B=128; 32×32× at 128K). At longer contexts (128K+), model-intrinsic accuracy degrades independent of block selection; the photonic scaling analysis nevertheless projects favorable energy and latency scaling to million-token regimes as model context windows continue to expand. The practical energy benefit emerges at n≥4n≥ 4K where block selection yields meaningful traffic reduction, making Prism favorable across virtually all practical context lengths. Future work will proceed along three axes: (i) fabrication and characterization of an 8×88× 8 TFLN MRR weight bank to validate simulation predictions with measured device parameters; (i) scaling to a full d=64d=64, N=256N=256 module integrated with GPU-based inference for end-to-end latency and energy measurements; and (i) integration of non-volatile weight storage (e.g., phase-change trimming [60, 1]) for write-once signature programming, together with hardware-aware learned projections and broader benchmarks such as SCBench [27]. More broadly, photonic broadcast search may serve as a general paradigm for similarity-search workloads in data centers—including approximate nearest-neighbor retrieval, recommendation ranking, and embedding lookup—wherever a single query must be compared against a large, slowly changing set of stored vectors. Disclosures The authors declare no conflicts of interest. Data Availability Code and simulation data are available at https://github.com/hyoseokp/PRISM [40]. References [1] U. Adya, S. Singhal, R. Chen, I. Chen, S. Joshi, A. Majumdar, M. Li, and S. Moazeni (2025) Non-volatile tuning of cryogenic silicon photonic micro-ring modulators. Nature Communications 16, p. 9290. External Links: Document Cited by: §VII.3, §VIII. [2] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai (2023) GQA: training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 4895–4901. Cited by: §I.1, §V.1, §VII.1. [3] Anthropic (2026) 1M context is now generally available for Opus 4.6 and Sonnet 4.6. Note: https://claude.com/blog/1m-context-ga1M token context window at standard pricing. March 13, 2026 Cited by: §I. [4] K. Aubrey (2026) Inside the NVIDIA Vera Rubin platform: six new chips, one AI supercomputer. Note: NVIDIA Developer Blog External Links: Link Cited by: §VII.2. [5] Y. Bai, X. Xu, M. Tan, Y. Sun, Y. Li, J. Wu, R. Morandotti, A. Mitchell, K. Xu, and D. J. Moss (2023) Photonic multiplexing techniques for neuromorphic computing. Nanophotonics 12 (5), p. 795–817. External Links: Document Cited by: §VI.4. [6] P. Behnam, Y. Fu, R. Zhao, P. Tsai, Z. Yu, and A. Tumanov (2025) RocketKV: accelerating long-context LLM inference via two-stage KV cache compression. In International Conference on Machine Learning (ICML), Cited by: §I, §I, §VII.2, §VII.2. [7] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos, S. Kumar Selvaraja, T. Claes, P. Dumon, P. Bienstman, D. Van Thourhout, and R. Baets (2012) Silicon microring resonators. Laser & Photonics Reviews 6 (1), p. 47–73. External Links: Document Cited by: §IV.1. [8] J. Choquette (2023) NVIDIA Hopper H100 GPU: scaling performance. IEEE Micro 43 (3), p. 9–17. External Links: Document Cited by: §IV.4. [9] W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, and I. A. Walmsley (2016) Optimal design for universal multiport interferometers. Optica 3 (12), p. 1460–1465. External Links: Document Cited by: §I.3. [10] T. Dao (2024) FlashAttention-2: faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, Note: arXiv:2307.08691 (2023) Cited by: §I, §I.1. [11] P. Dong (2016) Silicon photonic integrated circuits for wavelength-division multiplexing applications. IEEE Journal of Selected Topics in Quantum Electronics 22 (6), p. 370–378. External Links: Document Cited by: §VI.2. [12] F. Fayza, C. Demirkiran, S. P. Rao, D. Bunandar, U. Gupta, and A. Joshi (2025) Photonics for sustainable AI. Communications Physics 8. External Links: Document Cited by: §VII.3. [13] J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, and H. Bhaskaran (2021) Parallel convolutional processing using an integrated photonic tensor core. Nature 589, p. 52–58. External Links: Document Cited by: §I.4, §IV.4. [14] T. Ferreira de Lima, E. A. Doris, S. Bilodeau, W. Zhang, A. Jha, H. Peng, E. C. Blow, C. Huang, A. N. Tait, B. J. Shastri, and P. R. Prucnal (2022) Design automation of photonic resonator weights. Nanophotonics 11 (17), p. 3805–3822. External Links: Document Cited by: §IV.1. [15] T. Fu, J. Zhang, R. Sun, Y. Huang, W. Xu, S. Yang, Z. Zhu, and H. Chen (2024) Optical neural networks: progress and challenges. Light: Science & Applications 13. External Links: Document Cited by: §I, §I.3, §I.3. [16] Gemini Team Google (2024) Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530. Cited by: §I. [17] A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer (2024) AI and memory wall. IEEE Micro 44 (3), p. 33–39. External Links: Document Cited by: §I. [18] C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami (2024) KVQuant: towards 10 million context length LLM inference with KV cache quantization. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §I. [19] Y. Hu, Y. Song, X. Zhu, X. Guo, S. Lu, Q. Zhang, L. He, C. A. A. Franken, K. Powell, H. Warner, et al. (2025) Integrated lithium niobate photonic computing circuit based on efficient and high-speed electro-optic conversion. Nature Communications 16, p. 8178. External Links: Document Cited by: §I.6, Table 1, §VII.1, §VII.1. [20] S. Hua, E. Divita, S. Yu, B. Peng, C. Roques-Carmes, Z. Su, Z. Chen, Y. Bai, J. Zou, Y. Zhu, Y. Xu, C. Lu, Y. Di, H. Chen, L. Jiang, L. Wang, L. Ou, C. Zhang, J. Chen, W. Zhang, H. Zhu, W. Kuang, L. Wang, H. Meng, M. Steinman, and Y. Shen (2025) An integrated large-scale photonic accelerator with ultralow latency. Nature 640, p. 361–367. External Links: Document Cited by: §I, §I.3. [21] C. Huang, S. Bilodeau, T. F. de Lima, A. N. Tait, P. Y. Ma, E. C. Blow, A. Jha, H. Peng, B. J. Shastri, and P. R. Prucnal (2020) Demonstration of scalable microring weight bank control for large-scale photonic integrated circuits. APL Photonics 5 (4), p. 040803. External Links: Document Cited by: §I.3, §I.4, §I.6, §VI.1, §VII.1. [22] P. Indyk and R. Motwani (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC), p. 604–613. External Links: Document Cited by: §V.2. [23] J. Johnson, M. Douze, and H. Jégou (2021) Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7 (3), p. 535–547. External Links: Document Cited by: §VI.5, §VII.2. [24] W. B. Johnson and J. Lindenstrauss (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics 26, p. 189–206. External Links: Document Cited by: §I.2. [25] G. Kamradt (2023) Needle in a haystack — pressure testing LLMs. Note: https://github.com/gkamradt/LLMTest_NeedleInAHaystack Cited by: §V.3. [26] H. Li, Y. Li, A. Tian, T. Tang, Z. Xu, X. Chen, N. Hu, W. Dong, Q. Li, and L. Chen (2025) A survey on large language model acceleration based on KV cache management. Transactions on Machine Learning Research. Cited by: §I. [27] Y. Li, H. Jiang, Q. Wu, X. Luo, S. Ahn, C. Zhang, A. H. Abdi, D. Li, J. Gao, Y. Yang, and L. Qiu (2025) SCBench: a KV cache-centric analysis of long-context methods. In International Conference on Learning Representations (ICLR), Cited by: §VII.3, §VIII. [28] S. Lischke, A. Peczek, J. S. Morgan, K. Sun, D. Steckler, Y. Yamamoto, F. Korndörfer, C. Mai, S. Marschmeyer, M. Fraschke, A. Krüger, A. Beling, and L. Zimmermann (2021) Ultra-fast germanium photodiode with 3-dB bandwidth of 265 GHz. Nature Photonics 15, p. 925–931. External Links: Document Cited by: §I.4. [29] G. Liu, C. Li, Z. Ning, J. Lin, Y. Yao, D. Ke, M. Guo, and J. Zhao (2025) FreeKV: boosting KV cache retrieval for efficient LLM inference. arXiv preprint arXiv:2505.13109. Cited by: §I. [30] X. Liu, Z. Tang, P. Dong, Z. Li, B. Li, X. Hu, and X. Chu (2025) ChunkKV: semantic-preserving KV cache compression for efficient long-context LLM inference. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §I. [31] C. D. Manning, P. Raghavan, and H. Schütze (2008) Introduction to information retrieval. Cambridge University Press. Cited by: §V.2. [32] Meta AI (2026) The Llama 4 herd: the beginning of a new era of natively multimodal AI innovation. Note: https://ai.meta.com/blog/llama-4-multimodal-intelligence/Llama 4 Scout: 17B active params, 10M token context window Cited by: §I. [33] NVIDIA (2023) NVIDIA H100 Tensor Core GPU Datasheet. Note: 80 GB HBM3, 3.35 TB/s memory bandwidthhttps://w.nvidia.com/en-us/data-center/h100/ Cited by: Figure 16. [34] NVIDIA (2025) NVIDIA Blackwell B200 GPU. Note: 192 GB HBM3e, 8 TB/s memory bandwidthhttps://w.nvidia.com/en-us/data-center/dgx-b200/ Cited by: Figure 16. [35] NVIDIA (2026) Inference Context Memory Storage (ICMS): BlueField-4 DPU for LLM Inference. Note: Flash-backed KV cache with hardware-assisted eviction and prefetch. Announced CES 2026https://nvidianews.nvidia.com/news/nvidia-bluefield-4-powers-new-class-of-ai-native-storage-infrastructure-for-the-next-frontier-of-ai Cited by: §I, §I, §VII.2. [36] NVIDIA (2026) NVIDIA Vera Rubin Platform. Note: Announced at GTC 2026; next-generation AI platform with Rubin GPU, Vera CPU, HBM4, and integrated memory connectivityhttps://nvidianews.nvidia.com/news/nvidia-vera-rubin-platform Cited by: §I, Figure 16. [37] NVIDIA (2027) NVIDIA Rubin Ultra GPU with HBM4e. Note: 1 TB HBM4e, ∼ 32 TB/s bandwidth; announced at GTC 2025https://videocardz.com/newz/nvidia-unveils-rubin-ultra-with-1tb-hbm4e-memory-for-2027-feynman-architecture-in-2028 Cited by: Figure 16. [38] OpenAI (2026) Introducing GPT-5.4. Note: https://openai.com/index/introducing-gpt-5-4/1M token context window, native computer use. Released March 5, 2026 Cited by: §I. [39] K. Padmaraju and K. Bergman (2014) Resolving the thermal challenges for silicon microring resonator devices. Nanophotonics 3 (4–5), p. 269–281. External Links: Document Cited by: §IV.1, §VI.2. [40] H. Park and Y. Park (2025) PRISM: photonic retrieval-index similarity module — experiment code. Note: https://github.com/hyoseokp/PRISMSource code for retrieval-head identification, signature generation, recall measurement, hardware-aware simulation, and downstream evaluation Cited by: Data Availability. [41] H. Park and Y. Park (2026) Photonic softmax with micro-ring resonator cascades: analytical exponential function from lorentzian composition. arXiv preprint arXiv:2603.12934. Cited by: Table 1. [42] N. Peserico, B. J. Shastri, and V. J. Sorger (2023) Integrated photonic tensor processing unit for a matrix multiply: a review. Journal of Lightwave Technology 41 (12), p. 3704–3716. External Links: Document Cited by: §I.4. [43] Qwen Team (2024) Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. Cited by: §V.1, §V.3. [44] Qwen Team (2025) Qwen2.5-1M Technical Report. arXiv preprint arXiv:2501.15383. Cited by: §I, §V.3. [45] Qwen Team (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §V.1. [46] M. Reck, A. Zeilinger, H. J. Bernstein, and P. Bertani (1994) Experimental realization of any discrete unitary operator. Physical Review Letters 73 (1), p. 58–61. External Links: Document Cited by: §I.3. [47] T. J. Seok, K. Kwon, J. Henriksson, J. Luo, and M. C. Wu (2019) Wafer-scale silicon photonic switches beyond die size limit. Optica 6 (4), p. 490–494. External Links: Document Cited by: §VI.1. [48] B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. P. Pernice, H. Bhaskaran, C. D. Wright, and P. R. Prucnal (2021) Photonics for artificial intelligence and neuromorphic computing. Nature Photonics 15 (2), p. 102–114. External Links: Document Cited by: §I.3, §IV.4. [49] N. Shazeer (2019) Fast transformer decoding: one write-head is all you need. arXiv preprint arXiv:1911.02150. Cited by: §I.1. [50] S. Shekhar, W. Bogaerts, L. Chrostowski, J. E. Bowers, M. Hochberg, R. Soref, and B. J. Shastri (2024) Roadmapping the next generation of silicon photonics. Nature Communications 15, p. 751. External Links: Document Cited by: §VI.1. [51] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljačić (2017) Deep learning with coherent nanophotonic circuits. Nature Photonics 11, p. 441–446. External Links: Document Cited by: §I.3, §IV.2. [52] K. Staniszewski and A. Łańcucki (2026) KV cache transform coding for compact storage in LLM inference. In International Conference on Learning Representations (ICLR), Note: arXiv:2511.01815 Cited by: §VII.2. [53] F. Sunny, A. Mirza, M. Nikdast, and S. Pasricha (2021) CrossLight: a cross-layer optimized silicon photonic neural network accelerator. In Proc. 58th ACM/IEEE Design Automation Conference (DAC), p. 1069–1074. External Links: Document Cited by: §IV.2. [54] A. N. Tait, T. F. de Lima, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal (2016) Microring weight banks. IEEE Journal of Selected Topics in Quantum Electronics 22 (6), p. 5900110. External Links: Document Cited by: §I, §I.3, §VII.1. [55] A. N. Tait, T. F. de Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal (2017) Neuromorphic photonic networks using silicon photonic weight banks. Scientific Reports 7, p. 7430. External Links: Document Cited by: §I.6, §IV.5. [56] A. N. Tait, M. A. Nahmias, B. J. Shastri, T. F. de Lima, and P. R. Prucnal (2014) Broadcast and weight: an integrated network for scalable photonic spike processing. Journal of Lightwave Technology 32 (21), p. 3427–3439. External Links: Document Cited by: §I, §I.3. [57] H. Tang, Y. Lin, J. Lin, Q. Han, S. Hong, Y. Yao, and G. Wang (2024) RazorAttention: efficient KV cache compression through retrieval heads. arXiv preprint arXiv:2407.15891. Cited by: §I, §I.2. [58] J. Tang, Y. Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han (2024) Quest: query-aware sparsity for efficient long-context LLM inference. arXiv preprint arXiv:2406.10774. Cited by: §I, §I, §VII.2, §VII.2, §VII.3. [59] Y. Tian, S. Xiang, X. Guo, Y. Zhang, J. Xu, S. Shi, H. Zhao, Y. Wang, X. Niu, W. Liu, and Y. Hao (2025) Photonic transformer chip: interference is all you need. PhotoniX 6, p. 45. External Links: Document Cited by: §I, §VII.2. [60] B. Tossoun, D. Liang, S. Cheung, Z. Fang, X. Sheng, J. P. Strachan, and R. G. Beausoleil (2024) High-speed and energy-efficient non-volatile silicon photonic memory based on heterogeneously integrated memresonator. Nature Communications 15, p. 551. External Links: Document Cited by: §VII.3, §VIII. [61] A. Totovic, G. Giamougiannis, A. Tsakyridis, D. Lazovsky, and N. Pleros (2022) Programmable photonic neural networks combining WDM with coherent linear optics. Scientific Reports 12, p. 5605. External Links: Document Cited by: §I.3. [62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, p. 5998–6008. Cited by: §I.1. [63] C. Wang, M. Zhang, X. Chen, M. Bertrand, A. Shams-Ansari, S. Chandrasekhar, P. Winzer, and M. Lončar (2018) Integrated lithium niobate electro-optic modulators operating at CMOS-compatible voltages. Nature 562, p. 101–104. External Links: Document Cited by: §I.3. [64] W. Wu, Y. Wang, G. Xiao, H. Peng, and Y. Fu (2025) Retrieval head mechanistically explains long-context factuality. In International Conference on Learning Representations (ICLR), Note: Oral presentation Cited by: §I, §V.3. [65] C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang, Z. Liu, S. Han, and M. Sun (2024) InfLLM: training-free long-context extrapolation for LLMs with an efficient context memory. arXiv preprint arXiv:2402.04617. Cited by: §I, §VII.2, §VII.3. [66] G. Xiao, J. Tang, J. Zuo, J. Guo, S. Yang, H. Tang, Y. Fu, and S. Han (2025) DuoAttention: efficient long-context LLM inference with retrieval and streaming heads. In International Conference on Learning Representations (ICLR), Cited by: §I, §I.2, §I.2, 1st item, §VII.2. [67] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis (2024) Efficient streaming language models with attention sinks. In International Conference on Learning Representations (ICLR), Note: arXiv:2309.17453 Cited by: §I. [68] Q. Xu, B. Schmidt, S. Pradhan, and M. Lipson (2005) Micrometre-scale silicon electro-optic modulator. Nature 435, p. 325–327. External Links: Document Cited by: §I.3. [69] H. Zhang, Y. Song, S. Chen, Y. Bai, X. Xu, C. Huang, J. Wang, H. Chen, D. J. Moss, and K. Xu (2025) Integrated platforms and techniques for photonic neural networks. npj Nanophotonics 2, p. 40. External Links: Document Cited by: §I.3, §VII.3. [70] W. Zhang, F. Yin, H. Yen, D. Chen, and X. Ye (2025) Query-focused retrieval heads improve long-context reasoning and re-ranking. arXiv preprint arXiv:2506.09944. Cited by: §VII.3. [71] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett, Z. Wang, and B. Chen (2023) H2O: heavy-hitter oracle for efficient generative inference of large language models. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §I. [72] H. Zhou, J. Dong, J. Cheng, W. Dong, C. Huang, Y. Shen, Q. Zhang, M. Gu, C. Qian, H. Chen, Z. Ruan, and X. Zhang (2022) Photonic matrix multiplication lights up photonic accelerator and beyond. Light: Science & Applications 11, p. 30. External Links: Document Cited by: §I.3. [73] H. Zhu, J. Gu, H. Wang, Z. Jiang, Z. Zhang, R. Tang, C. Feng, S. Han, R. T. Chen, and D. Z. Pan (2024) Lightening-Transformer: a dynamically-operated optically-interconnected photonic transformer accelerator. In Proc. IEEE International Symposium on High-Performance Computer Architecture (HPCA), p. 686–703. External Links: Document Cited by: §I, §I.3, §I.3. [74] X. Zhu, Y. Hu, S. Lu, H. K. Warner, X. Li, Y. Song, L. S. Magalhães, A. Shams-Ansari, N. Sinclair, and M. Lončar (2024) Twenty-nine million intrinsic Q-factor monolithic microresonators on thin-film lithium niobate. Photonics Research 12 (8), p. A7–A14. External Links: Document Cited by: §I.6, Table 1. Supplementary Information Appendix S1 Device Impairment Models This section provides the full mathematical models for the six impairment sources that degrade the ideal inner-product computation of eq. 6. Weight quantization. MRR transmission is programmed via electro-optic tuning with finite precision. We model the quantized weight as w^n,j=round(wn,j⋅2b)2b, w_n,j= round(w_n,j· 2^b)2^b, (S1) where b is the effective bit precision. Values of b=4b=4–88 are considered, corresponding to 16–256 distinguishable transmission levels. Thermal drift. After initial calibration, the MRR resonance wavelength drifts due to ambient temperature fluctuations. We model the drift as a Gaussian random walk: Δλ0(t)=∑i=1t/Δt(0,σdrift2), _0(t)= _i=1^t/ tN(0, _drift^2), (S2) with σdrift _drift chosen to produce a standard deviation of 0.01 nm to 0.1 nm0.01\,nm0.1\,nm over a calibration interval TcalT_cal. The resulting weight error is Δw=|∂T∂λ|Δλ0≈8Q2Dmaxλ02⋅Δλ0, w= | ∂ T∂λ | _0≈ 8Q^2D_ _0^2· _0, (S3) evaluated at the operating point on the MRR Lorentzian. Note that the i.i.d. Gaussian model above does not capture spatially correlated drift (e.g., center-to-edge temperature gradients across the chip), which could cause systematic bias in the inner-product scores rather than zero-mean random noise; such gradients would require a correlated noise model or per-region calibration. Insertion loss. Each MRR introduces an off-resonance insertion loss ILMRR≈0.02 dB to 0.05 dBIL_MRR≈$0.02\,dB0.05\,dB$, and the 1×N1× N splitter contributes ILsplitIL_split from eq. 8. The total channel loss is ILtotal=ILsplit+d⋅ILMRR+ILwg,IL_total=IL_split+d·IL_MRR+IL_wg, (S4) where ILwgIL_wg accounts for waveguide propagation loss (∼ 0.3 dB/cm0.3\,dB/cm for TFLN). High insertion loss reduces the SNR at the photodetector and increases the required laser power. Photodetector noise. The photocurrent at each detector includes shot noise and thermal noise: σI2=2eIphΔf+4kBTRLΔf+NEP2⋅Δf, _I^2=2eI_ph\, f+ 4k_BTR_L\, f+NEP^2· f, (S5) where IphI_ph is the signal photocurrent, Δf f is the detection bandwidth, RLR_L is the load resistance, and NEP is the noise-equivalent power of the photodetector (∼10 pW/Hz $10\,pW/ Hz$ for Ge-on-Si) [8, 7]. The noise introduces a random perturbation to the inner-product score, potentially reordering the top-k ranking. MRR crosstalk. Adjacent MRRs on the same bus waveguide can exhibit spectral overlap if the channel spacing is insufficient relative to the MRR linewidth. We model inter-channel crosstalk as an additive interference with isolation of −15 dB to −30 dB-15\,dB-30\,dB: yn=∑j=1dwn,jsj+∑j=1d∑m≠jχj,mwn,msm,y_n= _j=1^dw_n,j\,s_j+ _j=1^d _m≠ j _j,m\,w_n,m\,s_m, (S6) where χj,m _j,m is the crosstalk coefficient from channel m to channel j [6]. Input DAC noise. The finite resolution and integral nonlinearity (INL) of the input DACs contribute an additional noise floor on the query sketch values. At bDAC=6b_DAC=6 bits, the quantization noise standard deviation is σDAC=2−bDAC/12≈0.0045 _DAC=2^-b_DAC/ 12≈ 0.0045. Figure S1: Impact of weight quantization on recall. At b=6b=6 bits, recall degrades by less than 5% from the floating-point ideal (Recall@8 = 0.960 at 6-bit). Adding thermal drift (σth=0.01 _th=0.01) and detector noise (σdet=0.01 _det=0.01) degrades recall by an additional 5%. Figure S2: Recall degradation as a function of thermal drift σ. Recall remains above 95% for σ≤0.005σ≤ 0.005 (corresponding to ∼ 150 pm150\,pm drift), achievable with standard thermal stabilization. At σ=0.01σ=0.01 (∼ 300 pm300\,pm), recall is still 94.8%. Figure S3: Weight encoding fidelity under MRR impairments (d=32d=32, N=128N=128). (a) Ideal vs. 5-bit quantised weights in [0,1][0,1]: the staircase pattern shows 32 discrete levels with RMSE =0.009=0.009. (b) Adding 20 pm20\,pm thermal drift and fabrication variation broadens the scatter (RMSE =0.014=0.014). (c) Error histograms for three configurations: 5-bit only, 5-bit with 20 pm20\,pm drift, and 4-bit with 30 pm30\,pm drift. Even the pessimistic case concentrates errors within ±5± 5% of the full weight range. Figure S4: Recall@8 degradation as a function of photodetector noise σdet _det (d=32d=32, N=500N=500, 100 trials). Recall remains above the 95% threshold for σdet≤0.01 _det≤ 0.01, achievable with standard Ge-on-Si photodetectors (NEP ∼ 1 pW/Hz1\,pW/ Hz). Figure S5: NIAH accuracy heatmap across context lengths (4K–128K) and MRR configurations. At k=32k=32, all MRR variants match full attention perfectly up to 64K. Under the k=8k=8 stress test, accuracy degrades gracefully with context length but remains above 90% at 16K. The 128K column shows model-intrinsic degradation (full attention itself drops to 45.5%). Figure S6: Signed vs. unsigned recall comparison. Balanced photodetection (signed [−1,+1][-1,+1]) consistently outperforms ReLU projection (unsigned [0,1][0,1]) and split encoding across all signature dimensions d. Appendix S2 Optical Link Budget Details Table S1: Optical link budget for d=32d=32, N=256N=256 (Plaser=20 dBmP_laser=$20\,dBm$). Element Loss (dB) Cumulative (dBm) Laser output — ++20.0 Fiber–chip coupling −-2.0 ++18.0 MZM modulator (avg.) −-3.0 ++15.0 1×2561× 256 splitter −-25.7a −-10.7 Waveguide (2 cm) −-1.0 −-11.7 d=32d=32 MRR (balanced, worst-case drop) −-3.2b −-14.9 Chip–detector coupling −-1.0 −-15.9 Per-PD optical power −15.9-15.9 dBm a 10log10(256)+0.2×8=24.1+1.6=25.710 _10(256)+0.2× 8=24.1+1.6=25.7 dB. b Drop-port 0.1 dB + 31×0.05 dB through-port = 1.65 dB; rounded to 3.2 dB with alignment margin [3]. Appendix S3 Latency Breakdown The Prism latency is the sum of the five pipeline stages: tPRISM=tDAC+topt+tPD+tADC+ttop-k,t_PRISM=t_DAC+t_opt+t_PD+t_ADC+t_top-k, (S7) where the optical propagation time toptt_opt includes the modulator response, waveguide transit, and MRR ring-down time. Table S2: Prism latency breakdown. Stage Latency Notes DAC ∼ 1 ns 4-bit MZM ∼ 0.1 ns Si depl. Opt. prop. ∼ 0.5 ns 5 cm MRR decay ∼ 0.1 ns Q=104Q=10^4 PD ∼ 0.2 ns Ge TIA+ADC ∼ 2 ns 6-bit flash Top-k ∼ 5 ns CMOS Total ∼ 9 ns Appendix S4 Retrieval Head Detailed Results Table S3: Retrieval head fraction at threshold τ=0.3τ=0.3 across context lengths. Rh(τ)R_h(τ): percentage of KV heads with Rh>τR_h>τ. Mean R¯h R_h: average retrieval ratio across all heads. Model Context Rh(τ=0.3)R_h(τ=0.3) (%) Heads Mean R¯h R_h Qwen2.5-7B 2K 83.9 94/112 0.574 4K 83.0 93/112 0.560 8K 91.1 102/112 0.627 16K 92.9 104/112 0.639 32K 95.5 107/112 0.656 65K 92.0 103/112 0.633 128K 98.2 110/112 0.796 Qwen3-8B 2K 86.5 250/288 0.626 4K 88.2 254/288 0.652 8K 89.6 258/288 0.657 Appendix S5 Crossover Derivation This section provides the full algebraic derivation of the energy crossover point n∗n^* summarized in Sec. VI.5. The total decode cost per token for a retrieval head consists of two terms: Ctotal=Cselect(n)+Cfetch(n,k),C_total=C_select(n)+C_fetch(n,k), (S8) where CselectC_select is the cost of determining which blocks to fetch and CfetchC_fetch is the cost of reading and computing attention over the selected blocks. For the GPU full scan baseline, no selection is needed (k=N=n/Bk=N=n/B), so the energy cost of reading all n keys is CGPU=CfetchGPU(n,n/B)=2dhnbprec⋅Ebyte,C_GPU=C_fetch^GPU(n,n/B)=2\,d_h\,n\,b_prec· E_byte, (S9) where Ebyte≈31 pJ/BE_byte≈$31\,pJ/B$ is the energy per byte for HBM3 memory access (H100 GPU) [4]. For Prism, the cost is CPRISM=CselectPRISM(n)+CfetchGPU(n,k⋅B),C_PRISM=C_select^PRISM(n)+C_fetch^GPU(n,k· B), (S10) where CselectPRISMC_select^PRISM is the dynamic energy of the photonic evaluation (laser, DAC, modulator, PD, ADC): CselectPRISM≈Edynamic,C_select^PRISM≈ E_dynamic, (S11) Note that on the TFLN platform, the static MRR tuning power is near zero (capacitive EO), so the selection cost is dominated entirely by the dynamic components (∼2290 pJ $2290\,pJ$ per query, table 3). The fetch cost is reduced by the selection ratio k/Nk/N: CfetchGPU(n,k⋅B)=2dhkBbprec⋅Ebyte.C_fetch^GPU(n,k· B)=2\,d_h\,k\,B\,b_prec· E_byte. (S12) The crossover occurs when CPRISM<CGPUC_PRISM<C_GPU, i.e., when the memory access energy saved by not fetching (N−k)(N-k) blocks exceeds the energy cost of operating the photonic selector. Setting CPRISM=CGPUC_PRISM=C_GPU and solving for n yields: n∗=CselectPRISM2dhbprecEbyte(1−kB/n∗),n^*= C_select^PRISM2\,d_h\,b_prec\,E_byte\,(1-k\,B/n^*), (S13) which must be solved self-consistently since k and N=n/BN=n/B both depend on n. In practice, k is a fixed parameter (e.g., k=32k=32), so the selection ratio kB/n→0k\,B/n→ 0 as n→∞n→∞, and the crossover simplifies to n∗≈CselectPRISM2dhbprecEbyte.n^*≈ C_select^PRISM2\,d_h\,b_prec\,E_byte. (S14) Appendix S6 WDM Channel Limits The maximum number of WDM channels d is constrained by the MRR free spectral range (FSR), the available optical bandwidth, and inter-channel crosstalk. Single-FSR constraint. For a TFLN MRR with radius R=20 µmR=$20\, m$, the FSR is approximately 8.3 nm8.3\,nm (at λ0=1550 nm _0=$1550\,nm$, as in table 1). At a channel spacing of Δλch=1.6 nm _ch=$1.6\,nm$ (200 GHz200\,GHz on the ITU grid) [5], the maximum number of non-aliased channels within one FSR is dmax(1-FSR)=⌊FSRΔλch⌋=⌊8.31.6⌋=5.d_ ^(1-FSR)= FSR _ch = 8.31.6 =5. (S15) This is clearly insufficient for the d=32d=32–128128 range targeted by Prism. FSR extension techniques. Vernier-coupled dual-ring filters or cascaded MRRs with slightly different radii can extend the effective FSR to ∼ 50 nm50\,nm or more [9, 2], limited by the least common multiple of the individual FSRs. With an extended FSR of 50 nm50\,nm and Δλch=1.6 nm _ch=$1.6\,nm$: dmax(Vernier)=⌊501.6⌋≈30.d_ ^(Vernier)= 501.6 ≈ 30. (S16) Band-limited operation. The usable optical bandwidth depends on the operating band: • C-band (1530–1565 nm1565\,nm): 35 nm35\,nm → practical limit d∼20d 20–3030 without FSR extension; • C+L band (1530–1625 nm1625\,nm): 95 nm95\,nm → d∼60d 60. For d>60d>60, extending to the S-band or using 0.8 nm0.8\,nm (100 GHz100\,GHz) channel spacing is necessary, at the cost of tighter crosstalk margins. In practice, achieving d=128d=128 requires both FSR extension and C+L+S band operation, representing a more aggressive photonic design point. S6.1 Balanced Photodetection for Signed Weights The add-drop MRR configuration provides both through-port transmission Tthrough(Δλ)T_through( λ) and complementary drop-port transmission Tdrop(Δλ)=1−Tthrough(Δλ)T_drop( λ)=1-T_through( λ) (for a lossless ring). A balanced photodetector pair measures the differential signal: w=Tthrough−Tdrop=2Tthrough−1∈[−1,+1].w=T_through-T_drop=2\,T_through-1∈[-1,\,+1]. (S17) This mapping naturally encodes signed weights without doubling the MRR count (as required by split encoding) or discarding sign information (as in ReLU projection). The noise model for balanced detection yields shot noise variance σ2=2eℛP0Δfσ^2=2eRP_0 f, independent of the programmed weight, since power is conserved: Pthrough+Pdrop=P0P_through+P_drop=P_0 [8]. Appendix S7 Multi-Head Serving with GQA A natural concern is whether serving all retrieval heads requires replicating the weight bank for each head. In Qwen2.5-7B, 102 out of 112 KV heads are retrieval heads at 8K context (table S3). However, GQA [1] reduces the number of independent KV heads to HKV=4H_KV=4—each KV head is shared across H/HKV=7H/H_KV=7 query heads. The weight bank stores block signatures derived from key vectors, so it operates at the KV-head granularity, not the query-head granularity. This means only 4 independent weight bank configurations are needed per layer, not 102. Two strategies can serve these 4 KV heads: 1. Time-multiplexed reuse. A single weight bank is reprogrammed sequentially for each of the 4 KV heads. On the TFLN platform, EO tuning settles in ∼1 ns $1\,ns$ (RC-limited), so reprogramming 4 heads adds only ∼4 ns $4\,ns$—negligible compared to the subsequent KV block fetch (∼5 µs $5\, s$). 2. Parallel replication. Four weight banks are deployed in parallel, one per KV head. This requires 4×d×N=4×64×1024=262 1444× d× N=4× 64× 1024=$262\,144$ MRRs total—a 4×4× increase over the single-head case but within the scalability limits discussed in Sec. VII.1. GQA thus reduces the multi-head serving problem from 102 independent banks (one per retrieval head) to just 4, making both time-multiplexed and spatially-parallel approaches practical. For Qwen3-8B (HKV=8H_KV=8), the same argument applies with 8 KV heads, still far below the 258 retrieval heads. References [1] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai (2023) GQA: training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 4895–4901. Cited by: Appendix S7. [2] R. Boeck, N. A. F. Jaeger, N. Rouger, and L. Chrostowski (2010) Series-coupled silicon racetrack resonators and the Vernier effect: theory and measurement. Optics Express 18 (24), p. 25151–25157. External Links: Document Cited by: Appendix S6. [3] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos, S. Kumar Selvaraja, T. Claes, P. Dumon, P. Bienstman, D. Van Thourhout, and R. Baets (2012) Silicon microring resonators. Laser & Photonics Reviews 6 (1), p. 47–73. External Links: Document Cited by: Table S1. [4] J. Choquette (2023) NVIDIA Hopper H100 GPU: scaling performance. IEEE Micro 43 (3), p. 9–17. External Links: Document Cited by: Appendix S5. [5] International Telecommunication Union (2020) Spectral grids for WDM applications: DWDM frequency grid. Note: ITU-T Recommendation G.694.1 Cited by: Appendix S6. [6] H. Jayatilleka, K. Murray, M. Caverley, N. A. F. Jaeger, L. Chrostowski, and S. Shekhar (2016) Crosstalk in SOI microring resonator-based filters. Journal of Lightwave Technology 34 (12), p. 2886–2896. External Links: Document Cited by: Appendix S1. [7] S. D. Personick (1973) Receiver design for digital fiber optic communication systems, I. Bell System Technical Journal 52 (6), p. 843–874. External Links: Document Cited by: Appendix S1. [8] B. E. A. Saleh and M. C. Teich (2019) Fundamentals of photonics. 3rd edition, John Wiley & Sons. Cited by: Appendix S1, §S6.1. [9] A. N. Tait, T. F. de Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal (2017) Neuromorphic photonic networks using silicon photonic weight banks. Scientific Reports 7, p. 7430. External Links: Document Cited by: Appendix S6.