Paper deep dive

PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection

Hyoseok Park, Yeonsang Park

Year: 2026Venue: arXiv preprintArea: physics.opticsType: PreprintEmbeddings: 110

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/26/2026, 1:32:44 AM

Summary

PRISM is a photonic similarity engine designed to overcome the O(n) memory bandwidth bottleneck in long-context LLM inference. By utilizing a broadcast-and-weight photonic architecture on thin-film lithium niobate (TFLN), PRISM performs O(1) similarity searches for KV cache block selection, significantly reducing memory traffic and energy consumption compared to electronic GPU baselines.

Entities (5)

PRISM · systemarchitecture · 100%Broadcast-and-weight · computing-paradigm · 95%KV Cache · memory-component · 95%Qwen2.5-7B · llm-model · 95%TFLN · material · 95%

Relation Signals (3)

PRISM → accelerates → KV cache block selection

confidence 95% · PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine.

KV Cache → bottlenecks → LLM inference

confidence 95% · Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache

PRISM → uses → TFLN

confidence 95% · a thin-film lithium niobate (TFLN) similarity engine.

Cypher Suggestions (2)

Find all systems that accelerate KV cache operations · confidence 90% · unvalidated

MATCH (s:System)-[:ACCELERATES]->(t:Task {name: 'KV cache block selection'}) RETURN s.name

Map the relationship between hardware architectures and their underlying materials · confidence 85% · unvalidated

MATCH (s:System)-[:USES]->(m:Material) RETURN s.name, m.name

Abstract

Abstract:Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).

PDF

Open source PDF →Open local PDF →

Full Text

109,560 characters extracted from source content.

Expand or collapse full text

†thanks: Corresponding author PRISM: Photonic Similarity Engine for KV Cache Block Selection in Long-Context LLM Inference Hyoseok Park Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea Yeonsang Park yeonsang.park@cnu.ac.kr Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea Abstract Long-context LLM inference is bottlenecked not by compute but by the memory bandwidth required to scan the KV cache at every decode step—a cost that grows linearly with context length. The semiconductor industry increasingly acknowledges this shift: NVIDIA’s Vera Rubin architecture dedicates an entire DPU (ICMS) to KV cache management with flash-backed storage and hardware-assisted prefetch—an architectural bet confirming that memory, not arithmetic, is the first-class constraint. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation. However, these approaches inherit the same O(n)O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm—the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4–6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1)O(1). We instantiate this insight in Prism, a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32k=32, with 32×32× traffic reduction. Prism achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n≥4n≥ 4K). I Introduction The dominant cost of large language model (LLM) inference is no longer floating-point arithmetic. As autoregressive decoding generates one token at a time, each step requires reading the full key–value (KV) cache accumulated over all previous tokens, computing attention scores, and writing the result back. For a model with L layers and H attention heads, each storing key and value vectors of dimension dhd_h, the KV cache occupies 2LHdh2LHd_h bytes per token (at half precision), growing linearly with context length n. At n=128 000n=$128\,000$ tokens, a 70-billion-parameter model’s KV cache can exceed 40 GB40\,GB—comparable to the entire model weight footprint—and the memory bandwidth required to stream this cache at every decode step far exceeds the compute throughput of modern GPUs [1]. This memory wall is intensifying [2]. Context windows are expanding aggressively: GPT-4 [3] and Gemini [4] pushed context to 128K tokens, Llama 3.1 supports 128 000128\,000 tokens [5], Qwen2.5 extends to one million [6], and multi-agent and retrieval-augmented generation (RAG) workloads routinely concatenate documents into contexts of hundreds of thousands of tokens. NVIDIA’s response in its Vera Rubin architecture is telling: the Intelligent Connectivity and Memory Switch (ICMS), built on the BlueField-4 data processing unit (DPU), adds a flash-based KV cache tier that can hold terabytes of context, together with hardware-assisted eviction and prefetch logic [7, 8]. This architectural bet confirms that KV cache management is now a first-class system design problem. Photonic circuits on thin-film lithium niobate (TFLN) offer a set of physical properties that are uniquely matched to this bottleneck. A wavelength-division-multiplexed (WDM) laser comb encodes a d-dimensional vector onto d co-propagating wavelengths in a single waveguide; a 1×N1× N passive splitter then broadcasts identical copies of that vector to N output channels with no additional energy cost beyond splitting loss. At each channel, a bank of microring resonators (MRRs)—one per wavelength—applies programmable transmission weights, and a broadband photodetector integrates over all wavelengths, yielding the analog inner product in a single optical transit (∼ 10 ps10\,ps per mm). The entire d×Nd× N matrix–vector product thus completes in O(1)O(1) latency, with energy scaling dominated by weight-programming overhead rather than memory-access energy. This broadcast-and-weight paradigm [9, 10] converts the memory-bandwidth-bound electronic problem into an optically parallel computation. Existing demonstrations of photonic neural-network accelerators have focused almost exclusively on dense matrix–vector multiplication for inference in convolutional and fully connected networks [11, 12, 13, 14]. In particular, Tian et al. demonstrated a photonic transformer chip (PTC) that implements full attention via coherent optical interference with runtime-programmable Mach–Zehnder meshes [13]; however, full-attention photonic computation faces the same O(n)O(n) memory scaling as electronic attention when applied to long contexts. In contrast, the coarse block-selection step in KV cache retrieval is not a dense neural-network layer—it is a large-scale similarity search: a single query vector must be compared against N∼103N 10^3–10410^4 stored signatures, and only the top-k matches are needed. This search problem has three properties that make it a more natural fit for the broadcast-and-weight architecture than general-purpose matrix multiplication: (i) the query is broadcast identically to all channels, perfectly matching the optical fan-out; (i) the stored signatures are quasi-static (updated every 64–512 tokens), so MRR weights can be programmed via electro-optic tuning (Pockels effect); and (i) only rank order matters, relaxing the precision requirement to 4–6 bits. We therefore propose the concept of a photonic broadcast search—an application of photonic broadcast-and-weight hardware not as a general neural-network accelerator, but as a specialized similarity engine for memory-intensive search tasks. A crucial observation simplifies the problem. Not all attention heads actually need the full cache. Recent work on retrieval heads [15, 16, 17] has shown that attention heads split into two categories: retrieval heads that attend to tokens far from the current position, and streaming heads that attend primarily to nearby tokens and “attention sinks.” The fraction classified as retrieval heads is threshold-dependent: DuoAttention identifies approximately 25% of heads as retrieval heads in MHA models and approximately 50% in GQA models via learned gating optimization [16], while our profiling on Qwen2.5-7B finds over 90% at a relaxed threshold (τ=0.3τ=0.3; Sec. V). This discrepancy reflects differing identification criteria rather than a contradiction—the key insight is that only the retrieval subset requires distant block fetches. This asymmetry has motivated a family of block-level selection methods that implement a coarse candidate selection step followed by fine attention over only the selected blocks on the GPU [18, 19, 20, 21, 22]. Complementary strategies include token-level eviction (H2O [23], StreamingLLM [24, 25]), two-stage coarse–fine retrieval (RocketKV [19]), and hardware-assisted caching (NVIDIA ICMS [8]). All electronic approaches share a common limitation: the coarse selection step itself consumes memory bandwidth proportional to the number of stored blocks. Recent analysis confirms that this block selection phase can consume the majority of total KV retrieval latency [26]. A photonic inner-product engine can break this scaling by performing all N similarity evaluations in parallel, using wavelength multiplexing to avoid the sequential memory access pattern entirely. We propose Prism (Photonic Ranking via Inner-product Similarity with Microring weights), a TFLN photonic similarity engine that realizes the photonic broadcast search concept for KV cache block selection. Prism encodes the query sketch onto d WDM wavelength channels, broadcasts it to N parallel MRR weight-bank channels via a 1×N1× N optical splitter, and computes all N similarity scores—each as an analog optical dot product In∝∑jwn,jsjI_n _jw_n,j\,s_j—in O(1)O(1) optical latency. A compact electronic top-k comparator selects the highest-scoring block indices, and only the corresponding KV blocks are fetched from memory. Figure 1 contrasts the conventional electronic full-scan approach with the Prism photonic block-selection pipeline. Figure 1: Conceptual comparison of KV cache access strategies. Left: Electronic GPU full scan—the processor sequentially reads all N KV blocks from HBM to compute attention, bottlenecked by memory bandwidth. Right: Prism photonic block selection—the query is broadcast optically to all N signature channels in parallel; only the top-k highest-scoring blocks are fetched from memory, reducing traffic by N/kN/k times. Our contributions are as follows: 1. Photonic broadcast search architecture. We propose and analyze a photonic similarity engine based on the broadcast-and-weight paradigm, specifically designed for the KV cache block-selection task. We present a complete optical power budget analysis covering splitting loss, MRR insertion loss, and photodetector noise floors, and derive the signal-to-noise ratio (SNR) requirements for reliable top-k ranking (Sec. I and Sec. IV). 2. Hardware-aware impairment modeling and NIAH validation. We build a device-level impairment model incorporating weight quantization (4–8 bits), residual thermal drift, insertion loss chains, photodetector noise, and MRR crosstalk, and show that recall degrades by less than 10% under realistic conditions. End-to-end needle-in-a-haystack (NIAH) evaluation with Qwen2.5-7B demonstrates that MRR-selected block-sparse attention matches full-attention accuracy at context lengths from 4K to 64K tokens (within the model’s native context window), while replacing the electronic selection with photonic O(1)O(1)-latency computation. Beyond 64K, model-intrinsic accuracy degrades independent of block selection (Sec. IV). 3. Photonic scaling analysis. We derive energy and latency models for Prism and electronic baselines (GPU full scan, GPU ANN, NVIDIA ICMS), identifying the context-length crossover point above which Prism is favorable, and analyze how the photonic architecture scales to million-token contexts (Sec. VI). 4. Retrieval head analysis and signature design. We systematically profile retrieval-head ratios across Qwen2.5-7B and Qwen3-8B, confirming that over 90% of KV heads are retrieval heads (at threshold τ=0.3τ=0.3), and evaluate block-level signatures, demonstrating that mean-key projection achieves 77.3% recall@8 with d=32d=32 (Sec. V). I Background I.1 KV Cache in Transformer Inference The core of modern LLMs is the multi-head self-attention mechanism [27]. Given an input sequence of n tokens embedded as ∈ℝn×dmodelX ^n× d_model, each attention head h in layer ℓ projects the input into queries, keys, and values: (ℓ,h) ^( ,h) =Q(ℓ,h), =X\,W_Q^( ,h), (ℓ,h) ^( ,h) =K(ℓ,h), =X\,W_K^( ,h), (ℓ,h) ^( ,h) =V(ℓ,h), =X\,W_V^( ,h), (1) where Q,K,V∈ℝdmodel×dhW_Q,W_K,W_V ^d_model× d_h and dh=dmodel/Hd_h=d_model/H is the per-head dimension. The attention output is computed as Attn(,,)=softmax(Tdh).Attn(Q,K,V)=softmax\! ( Q\,K^T d_h )V. (2) During the autoregressive decode phase, the model generates one token at a time. At step t, only the new query vector t∈ℝdhq_t ^d_h is computed, but the attention score requires the inner product of tq_t with all t previously cached key vectors: αt,i=t⋅idh,i=1,…,t. _t,i= q_t·k_i d_h, i=1,…,t. (3) The KV cache stores (ℓ,h)K^( ,h) and (ℓ,h)V^( ,h) for all layers and heads, consuming memory MKV=2LHKVdhnbprec,M_KV=2\,L\,H_KV\,d_h\,n\,b_prec, (4) where HKVH_KV is the number of KV heads (which equals H for multi-head attention but is reduced under grouped-query attention, GQA [28, 29]) and bprecb_prec is the byte width per element (2 for BF16). For Llama-3.1-8B (L=32L=32, HKV=8H_KV=8 with 4-group GQA, dh=128d_h=128) at n=128 000n=$128\,000$, eq. 4 gives MKV≈16 GBM_KV≈$16\,GB$, which already consumes a substantial fraction of GPU HBM and grows linearly with n. Crucially, the decode phase is memory-bandwidth-bound: each generated token requires reading the entire KV cache but performs only O(n⋅dh)O(n· d_h) multiply-accumulate operations per head. The arithmetic intensity (FLOPs per byte) is 1/(2dh)≪11/(2d_h) 1, far below the compute-to-bandwidth ratio of modern GPUs (50 FLOP/B to 200 FLOP/B50\,FLOP/B200\,FLOP/B), leaving the compute units idle while waiting for data [1]. I.2 Retrieval Heads and Selective Attention The observation that not all attention heads require the full KV cache was formalized by DuoAttention [16] and RazorAttention [17]. These works define a retrieval ratio Rh(ℓ,h)R_h^( ,h) for each head as the fraction of attention mass that falls outside a local window of size w: Rh(ℓ,h)=1−1T∑t=1T∑i=max⁡(1,t−w)tαt,i(ℓ,h),R_h^( ,h)=1- 1T _t=1^T _i= (1,t-w)^t _t,i^( ,h), (5) where αt,i(ℓ,h) _t,i^( ,h) is the attention weight from eq. 3 and T is the total sequence length of a calibration corpus. Heads with Rh>τR_h>τ (typically τ≈0.1τ≈ 0.1) are classified as retrieval heads; the rest are streaming heads. Empirically, DuoAttention identifies approximately 25% (MHA) to 50% (GQA) of heads as retrieval heads via learned gating optimization [16]. Streaming heads can be served with a small sliding-window cache (e.g., w=256w=256), drastically reducing their memory footprint. However, retrieval heads still require access to the full context, making their KV traffic the dominant bottleneck. I.3 Photonic Similarity Engine As noted in Sec. I, the coarse block-selection step is a similarity search whose properties—identical query fan-out, quasi-static weights, and rank-order-only output—make it a natural fit for photonic broadcast-and-weight hardware. We now review the key photonic concepts underlying this match. Broadcast-and-weight architecture. Tait et al. [9, 10] introduced the broadcast-and-weight (B&W) paradigm for neuromorphic photonic networks. In this architecture, d input signals are encoded on distinct wavelengths λ1,…,λd _1,…, _d and broadcast via a 1×N1× N optical splitter to N output channels. Each output channel contains d microring resonators (MRRs), each tuned to one wavelength, whose transmission coefficients serve as programmable weights wn,jw_n,j for channel n and wavelength j. A wavelength-insensitive photodetector at each output integrates over all wavelengths, yielding the photocurrent: In=ℛP0∑j=1dwn,jsj,I_n=R\,P_0 _j=1^dw_n,j\,s_j, (6) where ℛR is the detector responsivity, P0P_0 the per-channel optical power after splitting, and sjs_j the query signal on wavelength λj _j. The photocurrent InI_n is thus proportional to the inner product n⋅w_n·s—precisely the similarity score between stored signature n and the broadcast query. This operation completes in a single optical transit time (∼ 10 ps10\,ps per mm), independent of d and N (up to splitting loss limits). WDM spectral encoding. The query vector is encoded in the spectral domain: each component sjs_j modulates the optical power on a dedicated wavelength channel λj _j, so the full d-dimensional vector propagates as a single multi-wavelength beam in one waveguide. This spectral encoding is distinct from spatial encoding, where each component occupies a separate waveguide, because it enables the key broadcast step—splitting one waveguide into N copies—with no additional multiplexing hardware. Channel spacings of 0.8 nm to 1.6 nm0.8\,nm1.6\,nm within the C-band support d=32d=32–128128 channels using standard dense WDM (DWDM) laser combs and MRR filter banks. Comparison with other photonic paradigms. Alternative photonic architectures—MZI meshes [30, 31, 32, 14] and coherent processors—require O(d2)O(d^2) elements or global phase stability, and do not naturally support the one-to-many fan-out needed for similarity search. The broadcast-and-weight paradigm uses incoherent intensity-domain processing, where each MRR operates independently and the photodetector sums power rather than field amplitude, eliminating the need for global phase coherence and making it uniquely suited to the block-selection task. MRR weight banks. Each output channel employs d microring resonators whose electro-optically tunable transmission implements programmable weights w∈[0,1]w∈[0,1]. The MRR physics and TFLN-specific device parameters are detailed in Secs. I.4 and I.6. WDM-based matrix–vector multiplication. Scalable MRR weight banks with up to 16 wavelength channels and ∼ 7-bit precision have been demonstrated [33, 34, 35], and recent large-scale photonic accelerators validate integration beyond 16 00016\,000 components [11, 12, 14, 36]. The key advantage of this photonic approach for the KV cache selection problem is that the “weight matrix”—the collection of block signatures—is quasi-static and can be programmed into MRR resonances via electro-optic tuning, while the “input vector”—the query sketch—changes at every decode step but is broadcast optically to all N channels simultaneously. This decoupling of weight programming rate from inference rate is what enables the O(1)O(1) latency scaling that electronic approaches cannot match. While Lightening-Transformer [12] targets full attention computation, Prism takes a complementary approach: accelerating only the lightweight block-selection ranking task, which requires lower precision and fewer channels, making the photonic implementation more practical. I Photonic Retrieval Architecture I.1 System Overview Prism is a photonic similarity engine that sits between the KV cache storage (HBM or flash-backed ICMS) and the GPU’s attention compute units. It does not replace any part of the GPU pipeline; rather, it acts as a photonic broadcast search module that determines which KV cache blocks should be fetched for each retrieval head at each decode step. Figure 2: Prism system architecture (five-stage pipeline). Stage 1 (Query Encoding): The GPU/ASIC computes the query sketch =[q1,…,qd]q=[q_1,…,q_d] and encodes each component onto a WDM wavelength via DAC-driven modulators, producing a WDM query signal where P(λj)=qjP( _j)=q_j. Stage 2 (Broadcast): A 1×N1× N optical splitter distributes identical copies of the d-wavelength signal to all N signature channels (splitting loss: −10log10⁡N-10 _10N dB). Stage 3 (Signature Weighting): Each channel passes through a row of d MRRs on the TFLN photonic chip; the transmission tij=sijt_ij=s_ij of each MRR is electro-optically programmed via DC bias electrodes to encode the block signature weight, performing wavelength-selective multiplication Pout(λj)=qj×sijP_out( _j)=q_j× s_ij. Stage 4 (Summation): Broadband photodetectors integrate all wavelengths, yielding photocurrents Ii=ℛ⋅∑j(qj⋅sij)I_i=R· _j(q_j· s_ij) that are proportional to the inner product ⋅iq·s_i. Stage 5 (Top-k Selection): ADCs digitize the N photocurrents, a digital top-k selector identifies the k highest-scoring block indices, and a memory controller fetches only those KV blocks from HBM/flash storage. The system operates as a five-stage pipeline, illustrated in fig. 2. For each retrieval head at each decode step: 1. The GPU computes the query vector tq_t and applies the signature projection to obtain a d-dimensional query sketch qs_q. 2. qs_q is converted to the optical domain and broadcast. 3. The photonic weight bank computes N inner products in parallel. 4. Photodetectors produce N analog similarity scores. 5. A digital top-k selector identifies the best blocks, and only those blocks are fetched from KV cache storage. The GPU then computes exact attention over the selected blocks plus the local sliding window, producing the final attention output. I.2 Signature Encoding The performance of Prism depends critically on the quality of the block-level signatures programmed into the MRR weight banks. Since signature encoding defines the input interface between the digital LLM pipeline and the photonic engine, we describe it first. We consider four signature construction methods. Mean key. The simplest approach averages the key vectors within each block: σn=1B∑i∈blockni(ℓ,h). _n= 1B _i \,nk_i^( ,h). (7) This preserves the original key-space geometry but requires dhd_h-dimensional signatures (e.g., dh=128d_h=128), demanding a correspondingly large number of MRRs per channel. PCA projection. Principal component analysis over the key distribution yields a projection matrix ∈ℝd×dhP ^d× d_h (d≪dhd d_h) that captures the dominant variance directions. The signature becomes σn=¯n _n=P\, k_n, reducing the MRR count per channel from dhd_h to d. Random projection. The Johnson–Lindenstrauss (JL) lemma guarantees that a random Gaussian matrix ∈ℝd×dhR ^d× d_h with d=O(ϵ−2log⁡N)d=O(ε^-2 N) preserves pairwise distances (and hence inner-product rankings) to within a factor 1±ϵ1±ε with high probability [37]. The query sketch is computed identically: q=ts_q=R\,q_t. Random projection is attractive because it requires no training and provides worst-case guarantees. Learned projection. A trainable linear layer proj∈ℝd×dhW_proj ^d× d_h is optimized end-to-end to maximize recall@k on a calibration set. This can outperform random projections when the key distribution has exploitable structure, but requires per-model training. Balanced photodetection. The add-drop MRR configuration provides both through-port and drop-port outputs simultaneously. A balanced photodetector pair measures the differential photocurrent Ibal=Ithrough−IdropI_bal=I_through-I_drop, yielding a signed weight wn,j=Tthrough(λj)−Tdrop(λj)∈[−1,+1]w_n,j=T_through( _j)-T_drop( _j)∈[-1,\,+1]. On-resonance (minimum through-port transmission), w≈−1w≈-1; fully detuned, w≈+1w≈+1. This eliminates the need for split encoding or ReLU projection, enabling direct signed inner products with d MRRs per channel (half the count of split encoding) while preserving full sign information. I.3 WDM Query Broadcast The d-dimensional query sketch q=[s1,s2,…,sd]s_q=[s_1,s_2,…,s_d] is converted from the digital domain by d digital-to-analog converters (DACs), each driving a Mach–Zehnder modulator (MZM) [38] that impresses the value sis_i onto wavelength λi _i from a WDM laser comb source. The modulated signals are multiplexed into a single waveguide carrying d wavelength-encoded values [39, 40]. The DAC resolution requirement is modest: since the task is ranking rather than exact computation, 4–6 bits of input precision suffice (Sec. IV.1). This relaxation is critical because high-resolution, high-speed DACs are a major energy cost in photonic accelerators. At 4-bit resolution, a DAC operating at 1 GSa/s1\,GSa/s consumes approximately 0.5 mW0.5\,mW per channel. The multiplexed d-wavelength signal is then split into N copies by a 1×N1× N optical splitter tree. Each copy carries the full query sketch, attenuated by the splitting loss: Lsplit=10log10⁡(N)+αexcess⌈log2⁡N⌉[dB],L_split=10 _10(N)+ _excess\, _2N [dB], (8) where αexcess≈0.2 dB _excess≈$0.2\,dB$ per stage for optimized 1×21× 2 directional couplers. For N=1024N=1024 blocks, the total splitting loss is approximately 32 dB32\,dB, requiring a laser source power of 10 dBm to 20 dBm10\,dBm20\,dBm to maintain adequate signal-to-noise ratio (SNR) at the photodetectors. To manage loss, the N channels can be organized into NbankN_bank independent banks, each serving N/NbankN/N_bank channels with a separate splitter tree. This reduces per-bank splitting loss at the cost of additional laser sources or optical amplifiers. The key point is that the broadcast is passive and energy-free: the same query vector reaches all N channels simultaneously, with no per-channel memory access or data movement. I.4 MRR Weight Bank Similarity Engine Each of the N output channels contains a linear array of d MRRs, one per wavelength channel. The j-th MRR in channel n is electro-optically tuned so that its transmission at wavelength λj _j encodes the signature weight wn,jw_n,j: Pout,n(λj)=wn,j⋅Pin(λj),wn,j=Tthrough,n(λj)−Tdrop,n(λj)∈[−1,+1]. splitP_out,n( _j)&=w_n,j· P_in( _j),\\ w_n,j&=T_through,n( _j)-T_drop,n( _j)∈[-1,\,+1]. split (9) The total number of MRRs in the system is d×Nd× N. For d=64d=64 and N=1024N=1024, this yields 65 53665\,536 MRRs—a large but feasible integration scale for current photonic platforms [35]. Weight programming occurs at the block completion rate. When a new KV cache block of B tokens is completed, the corresponding column of MRR weights is updated via electro-optic (Pockels) tuning with sub-nanosecond response time. During steady-state decoding, the weight bank is static and the only dynamic signal is the broadcast query sketch. Because TFLN EO tuning is capacitive, the MRR weight bank consumes near-zero static power—only switching energy (∼5 fJ $5\,fJ$ per weight update) is required. This 5 fJ5\,fJ figure refers to the MRR electrode charging energy alone; the total switching energy including CMOS driver circuits is estimated at 50–500 fJ. Each channel terminates in a broadband (wavelength-insensitive) photodetector [41] that integrates the optical power across all d wavelengths: In=ℛ∑j=1d[Tthrough,n(λj)−Tdrop,n(λj)]sjP0=ℛ∑j=1dwn,jsjP0, splitI_n&=R _j=1^d [T_through,n( _j)-T_drop,n( _j) ]s_j\,P_0\\ &=R _j=1^dw_n,j\,s_j\,P_0, split (10) where wn,j∈[−1,+1]w_n,j∈[-1,\,+1]. This is precisely an analog optical dot product: the photocurrent In∝∑j=1dwn,jsjI_n _j=1^dw_n,j\,s_j computes the similarity score n⋅qw_n·s_q between stored block signature n and the broadcast query [42], with no explicit multiply-accumulate circuit. The physics of broadband photodetection inherently performs the summation—no electronic accumulator is needed [43]. I.5 Electronic Top-k Interface The N photocurrents are converted to digital values by N ADCs and fed to a digital top-k comparator network. The comparator identifies the k channels with the largest similarity scores and outputs their indices. For k≪Nk N, a partial-sort network suffices, with complexity O(Nlog⁡k)O(N k) and latency of a few nanoseconds at 1 GHz1\,GHz clock. The ADC resolution can be as low as 4–6 bits, since only the rank ordering matters. I.6 Device Parameters Figure 3: Prism photonic chip layout for an 8×88× 8 configuration (d=8d=8 WDM channels, N=8N=8 signature rows). Left: the WDM query input (λ1 _1–λ8 _8) enters and is split by cascaded 1×21× 2 Y-junctions. Center: each row contains d MRRs coupled to a bus waveguide with coupling gap of ∼ 200–300 nm; EO DC bias electrodes program the MRR resonances to encode signature weights via the Pockels effect. Right: through-port and drop-port outputs route to balanced Ge-on-Si PD pairs (or optionally on-chip integrated photodetectors). Scale bar: 100 µm m. The layout scales to d=32d=32, N=256N=256 by increasing the splitter tree depth and the number of rows. Figure 4: X-cut TFLN rib waveguide cross-section. The rib is etched 500 nm into a 600 nm LN film on SiO2, leaving a 100 nm slab. Lateral Au electrodes apply DC bias for electro-optic (Pockels) tuning of the MRR resonance wavelength. Waveguide width: 1.4 µm m. Table 1 summarizes the assumed device parameters for the thin-film lithium niobate (TFLN) photonic platform, based on recent demonstrations of high-Q TFLN micro-ring resonators [44, 45] and MRR weight bank architectures [46, 35]. The physical chip layout for an 8×88× 8 demonstration configuration is shown in fig. 3. Table 1: Prism device parameters (TFLN platform). Parameter Value Notes Platform X-cut TFLN 600 nm LN/SiO2 Waveguide Rib, 1.4×0.6μ1.4\!×\!0.6\, 500 nm etch MRR radius 20μ20\, FSR≈\,≈\,8.3 nm QLQ_L ∼104 \!10^4 FDTD: 12,500 Extinction >>15 dB Add-drop Wt. precision 5 bit EO resolution Tuning EO (Pockels) 28.5 pm/V Static power ∼0 0 Capacitive EO Switch energy ∼ 5 fJ/ring Per update Tuning speed <<1 ns EO response Modulator TFLN MZM VπL∼V_πL\! \!2 V⋅·cm Photodetector Balanced PD Differential WDM ch. d=32d\!=\!32–128 1.6 nm spacing† Laser Comb source ≤ 100 mW † d=32d=32–6464 is realistic with current C+L band technology; d=128d=128 requires C+L+S band operation and has not been experimentally demonstrated. The total MRR count (d×Nd× N; eq. 9) scales with configuration as shown in table 10. Because TFLN electro-optic tuning is capacitive, the static power consumption is near zero (Sec. VI.2). IV Photonic Hardware Analysis We now incorporate realistic photonic device impairments into the Prism simulation and quantify the optical link budget, noise performance, and energy–latency tradeoffs against electronic baselines. IV.1 Device Impairment Modeling We model six impairment sources that degrade the ideal inner-product computation of eq. 6 [47]: (i) weight quantization (4–8 bit DAC precision) [48], (i) thermal drift of MRR resonance wavelengths (σdrift=0.01 nm to 0.1 nm _drift=$0.01\,nm0.1\,nm$) [49], (i) MRR and waveguide insertion loss, (iv) photodetector shot and thermal noise (NEP ∼10 pW/Hz $10\,pW/ Hz$), (v) inter-channel MRR crosstalk (−15 dB to −30 dB-15\,dB-30\,dB isolation), and (vi) input DAC quantization noise. Table 3 summarizes the parameter ranges used in the hardware simulation. Full impairment models are provided in Supplementary Section S1. IV.2 Optical Link Budget A critical question for any photonic accelerator is whether sufficient optical signal-to-noise ratio (SNR) can be maintained across the complete optical path [32]. Table 2 traces the optical power from the laser source to each photodetector for a representative configuration (d=32d=32, N=256N=256). Table 2: Optical link budget for d=32d=32, N=256N=256 (Plaser=20 dBmP_laser=$20\,dBm$). Element Loss (dB) Cumulative (dBm) Laser output — ++20.0 Fiber–chip coupling −-2.0 ++18.0 MZM modulator (avg.) −-3.0 ++15.0 1×2561× 256 splitter −-25.7a −-10.7 Waveguide (2 cm) −-1.0 −-11.7 d=32d=32 MRR (balanced, worst-case drop) −-3.2b −-14.9 Chip–detector coupling −-1.0 −-15.9 Per-PD optical power −15.9-15.9 dBm a 10log10⁡(256)+0.2×8=24.1+1.6=25.710 _10(256)+0.2× 8=24.1+1.6=25.7 dB. b Drop-port 0.1 dB + 31×0.05 dB through-port = 1.65 dB; rounded to 3.2 dB with alignment margin [47]. Balanced detection link budget. The link budget in table 2 traces the drop-port path to the target photodetector. In the balanced configuration used by Prism, each MRR channel requires two optical paths—through-port and drop-port—each terminated by a separate photodetector and TIA. The through-port path sees lower loss (no drop-port penalty), so the drop-port budget above represents the worst case. Consequently, balanced detection doubles the photodetector and TIA count to 2N2N per wavelength channel; this overhead is reflected in table 4. At PPD=−15.9 dBm≈25.7 µWP_PD=$-15.9\,dBm$≈$25.7\, W$ per detector, the resulting photocurrent is Iph=ℛ⋅PPD=1.0×25.7 µW=25.7 µAI_ph=R· P_PD=1.0×$25.7\, W$=$25.7\, A$. The electrical SNR at the detector is SNR=Iph22eIphΔf+4kBTΔf/RL+(ℛ⋅NEP)2Δf,SNR= I_ph^22eI_ph f+4k_BT f/R_L+(R·NEP)^2 f, (11) where Δf≈1 GHz f≈$1\,GHz$ (matching the query update rate) and NEP=10 pW/HzNEP=$10\,pW/ Hz$ (eq. S5). For RL=1 kΩR_L=$1\,k $ and T=300 KT=$300\,K$, we obtain SNR ≈37.2 dB≈$37.2\,dB$—well above the minimum required for reliable rank ordering [50]. (Note: RL=1 kΩR_L=$1\,k $ assumes a transimpedance amplifier (TIA) front-end rather than 50 Ω50\, termination.) For larger bank sizes (N=1024N=1024), the additional 6 dB6\,dB splitting loss reduces the per-detector power to −21.9 dBm≈6.5 µW$-21.9\,dBm$≈$6.5\, W$, yielding SNR ≈25.5 dB≈$25.5\,dB$. This remains adequate for top-k ranking, as verified by the recall analysis in Sec. IV.3. Beyond N=4096N=4096 (SNR ≈13.5 dB≈$13.5\,dB$), the link budget requires either a higher-power laser (Plaser>26 dBmP_laser>$26\,dBm$) or the banked splitter architecture described in Sec. I.3. Figure 5 illustrates the per-detector received power and SNR as a function of the bank size N, clearly showing the crossover point at which banked architectures or optical amplification become necessary. Figure 5: Optical power budget analysis. (a) Per-detector received power vs. bank size N for three laser powers. The horizontal dashed line indicates the minimum detectable power (−20 dBm-20\,dBm). (b) Electrical SNR at the photodetector vs. signature dimension d for N=256N=256 and N=1024N=1024. The shaded region marks SNR >20 dB>$20\,dB$, sufficient for reliable top-k ranking. Figure 6: MRR weight encoding principle. (a) Through-port and drop-port transmission of a single add-drop MRR (QL=10,000Q_L=10,000, ER =20=20 dB). The balanced weight w=Tthrough−Tdropw=T_through-T_drop maps from −1-1 (on-resonance) to +1+1 (fully detuned). (b) Weight-to-balanced-transmission mapping for different DAC precisions. Table 3: Device impairment parameter ranges used in hardware simulation. Impairment Parameter Range Weight quantization b (bits) 4–8 Thermal drift σdrift _drift (pmpm) 10–100 MRR insertion loss ILMRR_MRR (dBdB) 0.02–0.05c Splitter excess loss αexcess _excess (dB/stagedB/stage) 0.1–0.3 Detector NEP (pW/HzpW/ Hz) 1–20 MRR crosstalk Isolation (dBdB) −-15 to −-30 DAC resolution bDACb_DAC (bits) 4–8 c Through-port IL per non-target MRR; drop-port (target MRR) IL is ∼0.1 0.1 dB. IV.3 Recall Degradation Analysis We inject impairments into the inner-product computation and measure recall@k degradation relative to the ideal (floating-point) baseline. Individual impairment sweeps (quantization precision, thermal drift, weight fidelity, and detector noise) are presented in Supplementary Figs. S1–S4. Figure 7: Digital vs. MRR photonic inner-product comparison (d=32d=32, N=256N=256, K=8K=8). (a) Score correlation between exact (FP64) and MRR-computed similarity for three hardware configurations. Pearson correlation ρ>0.98ρ>0.98 for all configs. (b) Rank agreement for the nominal config (5-bit, 20 pm20\,pm): green triangles indicate correctly identified top-K blocks (7/8 match, Recall@8 =88=88%). (c) Normalised score error distributions; pessimistic config (4-bit, 30 pm30\,pm) shows wider tails but remains zero-centred. Combined impairments. We simulate the full impairment chain (quantization + drift + loss + noise + crosstalk) using a Monte Carlo approach with 100 trials of 500 blocks (d=32d=32). Figure 7 visualises the effect for a single trial: the MRR scores correlate strongly with the digital baseline (ρ>0.98ρ>0.98), and the top-k ranking is largely preserved. Figure 8 maps the recall degradation as a function of both weight precision and thermal drift magnitude, identifying the operating region in which Recall@8 exceeds 80%. The combined recall degradation at b=6b=6, σth=0.01 _th=0.01, and σdet=0.01 _det=0.01 is approximately 8%, yielding an effective Recall@8 of 0.916±0.0870.916± 0.087 (vs. 1.000 ideal). Each impairment source individually contributes modestly (5-bit quantization: 0.904, drift: 0.948, noise: 0.928), but their combination remains above the 90% threshold required for effective block selection. Figure 8: Combined impairment sensitivity: Recall@8 as a function of weight precision (bits) and thermal drift σ (d=32d=32, N=500N=500, 50 Monte Carlo trials per cell). Markers indicate the three operating points studied in this work: nominal (5-bit, σ=0.01σ=0.01), optimistic (6-bit, σ=0.005σ=0.005), and pessimistic (4-bit, σ=0.02σ=0.02). Recall exceeds 80% for ≥5≥5-bit precision and σ≤0.02σ≤ 0.02. The recall degradation results establish the acceptable operating region for the MRR weight bank. End-to-end NIAH validation with MRR-simulated block selection, confirming that these impairments do not degrade downstream task accuracy, is presented in Sec. V.3. IV.4 Energy Model Table 4 breaks down the energy per query evaluation for the Prism system. We define the energy metric as energy per inner-product evaluation (i.e., per block scored per query per head). Table 4: Prism energy breakdown per query (d=64d=64, N=1024N=1024, k=32k=32, TFLN platform). Component Power (mWmW) Energy/query (pJpJ) Laser source 20.0 180 TEC (thermal stab.) 1000 9000† Voltage driver array 5.0 45 DACs (d channels) 32.0 288 MZM modulators 6.4 58 EO bias (static) ∼0 0 ∼0 0∗ Photodetectors (2N2N, balanced) 10.0 90 TIAs + ADCs (2N2N, balanced) 100.0 900 Top-k logic 1.0 9 Dynamic subtotal 174.4 1570 Total (incl. TEC) 1174.4 10570 ∗TFLN EO tuning is capacitive; switching energy ∼5 fJ $5\,fJ$/ring. †TEC power is amortized across all heads and queries; per-query share ≪1 nJ $1\,nJ$ at realistic throughput. A key advantage of the TFLN platform is the elimination of static MRR tuning power. TFLN electro-optic tuning via the Pockels effect is capacitive and consumes near-zero static power (see Sec. VI.2 for the quantitative SOI comparison). The only energy cost per weight update is the switching energy of ∼5 fJ $5\,fJ$ per ring, which is negligible compared to the dynamic optical and electronic components [39]. Note that while the total system power (∼1.17 W $1.17\,W$) is dominated by the TEC, this is a fixed overhead shared across all heads and queries; at typical decode throughput (>>10001000 tokens/s), the amortized TEC contribution per query is <1 µJ<$1\, J$—still well below the GPU baseline. For comparison, the H100 GPU full-scan baseline reads every KV block signature once per query per head. The energy per selection is Escan=2dhNbprecEbyte=2×128×1024×2×31pJ/B≈16.3 µJ, splitE_scan&=2\,d_h\,N\,b_prec\,E_byte\\ &=2× 128× 1024× 2× 31\;pJ/B\\ &≈$16.3\, J$, split (12) where dh=128d_h=128 is the head dimension, N=n/B=1024N=n/B=1024 blocks at 128K context (B=128B=128), bprec=2b_prec=2 B (bf16), and Ebyte≈31 pJ/BE_byte≈$31\,pJ/B$ (≈3.9 pJ/bit≈$3.9\,pJ/bit$, standard HBM3 specification) [51]. Note that this baseline assumes GPU scans the full key dimension dh=128d_h=128; if the GPU instead scans compressed d=32d=32 signatures, the energy reduces to ∼4.1 µJ $4.1\, J$ (4×4× lower). Even in this fairer comparison, Prism’s ∼1570 pJ $1570\,pJ$ selection energy remains over three orders of magnitude below the GPU scan [43]. GPU ANN (FAISS IVF-PQ) reduces the full-key scan to ∼ 5 µJ5\, J by scanning O(N)O( N) centroids. NVIDIA ICMS consumes ∼ 10 µJ10\, J, estimated by replacing BWHBMBW_HBM with the DPU’s internal LPDDR5 bandwidth (∼100 GB/s $100\,GB/s$) and assuming a similar scan pattern over the flash-backed KV index. IV.5 Latency Model The Prism latency is the sum of the five pipeline stages: tPRISM=tDAC+topt+tPD+tADC+ttop-k,t_PRISM=t_DAC+t_opt+t_PD+t_ADC+t_top-k, (13) where the optical propagation time toptt_opt includes the modulator response, waveguide transit, and MRR ring-down time. Table 5: Prism latency breakdown. Stage Latency Notes DAC ∼ 1 ns 4-bit MZM ∼ 0.1 ns Si depl. Opt. prop. ∼ 0.5 ns 5 cm MRR decay ∼ 0.1 ns Q=104Q=10^4 PD ∼ 0.2 ns Ge TIA+ADC ∼ 2 ns 6-bit flash Top-k ∼ 5 ns CMOS Total ∼ 9 ns The total Prism latency of ∼ 9 ns9\,ns compares favorably with the electronic baselines: GPU full scan ∼ 5 µs5\, s, GPU ANN ∼ 1 µs1\, s, and NVIDIA ICMS ∼ 0.5 µs0.5\, s—representing a ∼500× 500× speedup over full scan. However, this comparison must account for the additional latency of fetching the selected KV blocks from memory after Prism selection, which adds 0.5 µs to 2 µs0.5\, s2\, s depending on the memory tier (HBM vs. flash). The net latency benefit of Prism is therefore most pronounced when the selection ratio k/Nk/N is small and the KV cache resides in a slow memory tier (e.g., flash in ICMS). The crossover analysis quantifying these trade-offs across context lengths and baselines is presented in Sec. VI.5. Figure 9 summarises the interplay between signature dimension, photodetector SNR, and ranking accuracy across the operating envelope of Prism. Figure 9: SNR and recall analysis. (a) Electrical SNR at the photodetector as a function of signature dimension d for three bank sizes. (b) Recall@8 vs. SNR showing that reliable top-k selection (>90>90% recall) requires SNR ≳15 dB $15\,dB$. Balanced photodetection noise. In the balanced configuration, each channel uses two photodetectors measuring through-port and drop-port signals independently [46]. Shot noise from both PDs adds in quadrature: σI2=2e(Ithrough+Idrop)Δf _I^2=2e(I_through+I_drop) f. Since Ithrough+Idrop=ℛP0I_through+I_drop=RP_0 (power conservation), the total shot noise is weight-independent, simplifying the noise analysis. The factor of 2 2 increase in noise is offset by the doubled signal dynamic range ([−1,+1][-1,+1] vs [0,1][0,1]). V System-Level Evaluation This section evaluates the complete Prism pipeline from algorithmic profiling through end-to-end validation. We first profile retrieval heads across two LLM families (Sec. V.1), then evaluate block signature design and recall (Sec. V.2), and validate downstream accuracy via Needle-in-a-Haystack experiments with MRR-simulated block selection (Sec. V.3). V.1 Retrieval-Head Analysis Models and datasets. We profile two representative open-weight LLMs: Qwen2.5-7B-Instruct [52] (L=28L=28, H=28H=28, dh=128d_h=128, GQA [28] with HKV=4H_KV=4; total 112 KV heads) and Qwen3-8B [53] (L=36L=36, H=32H=32, dh=128d_h=128, GQA with HKV=8H_KV=8; total 288 KV heads). Qwen2.5-7B supports context lengths of at least 128 000128\,000 tokens; Qwen3-8B supports up to 32 00032\,000 tokens. We compute retrieval ratios Rh(ℓ,h)R_h^( ,h) (eq. 5) on a calibration set of 2–3 random token sequences per context length, with w=256w=256 as the local window size. The retrieval ratio is measured using a two-step procedure: SDPA-based prefill followed by eager last-token attention extraction. All experiments are run on an NVIDIA RTX 5880 (48 GB VRAM) for bf16 models, and an NVIDIA RTX 5070 (12 GB) for 4-bit quantized variants. We additionally verify consistency between bf16 and 4-bit quantized Qwen2.5-7B, finding that quantization does not substantially alter retrieval head identification (e.g., 91.1% vs. 92.0% at 8K context for bf16 and 4-bit, respectively). Results. Figure 10 shows the retrieval ratio heatmap across all layers and heads for both models. Figure 10: Retrieval ratio Rh(ℓ,h)R_h^( ,h) for each KV head across all layers. Heads with Rh>0.3R_h>0.3 (dashed line) are classified as retrieval heads. (a) Qwen2.5-7B: 102/112 heads are retrieval heads (91.1%) at 8K context. (b) Qwen3-8B: 258/288 heads are retrieval heads (89.6%) at 8K context. Table 6 summarizes the retrieval head fraction as a function of context length for both models. Table 6: Retrieval head fraction at threshold τ=0.3τ=0.3 across context lengths. Rh(τ)R_h(τ): percentage of KV heads with Rh>τR_h>τ. Mean R¯h R_h: average retrieval ratio across all heads. Model Context Rh(τ=0.3)R_h(τ=0.3) (%) Heads Mean R¯h R_h Qwen2.5-7B 2K 83.9 94/112 0.574 4K 83.0 93/112 0.560 8K 91.1 102/112 0.627 16K 92.9 104/112 0.639 32K 95.5 107/112 0.656 65K 92.0 103/112 0.633 128K 98.2 110/112 0.796 Qwen3-8B 2K 86.5 250/288 0.626 4K 88.2 254/288 0.652 8K 89.6 258/288 0.657 Figure 11 visualizes the retrieval head fraction and mean retrieval ratio as a function of context length. Figure 11: Retrieval head statistics vs. context length. (a) Retrieval head fraction Rh(τ=0.3)R_h(τ=0.3) for Qwen2.5-7B (bf16 and 4-bit) and Qwen3-8B (bf16). The fraction exceeds 90% for n≥8n≥ 8K and approaches 99% at 256K context. (b) Mean retrieval ratio R¯h R_h for Qwen2.5-7B (bf16), showing that individual-head retrieval strength also increases with context length. We observe the following patterns: • Ubiquity of retrieval behavior. At a threshold of τ=0.3τ=0.3, 91.1% of KV heads in Qwen2.5-7B and 89.6% in Qwen3-8B are retrieval heads at 8K context. This prevalence increases with context length: for Qwen2.5-7B, the fraction rises from 83.9% at 2K to 98.2% at 128K context, indicating that nearly all heads engage in long-range retrieval at long contexts. Note that at the more permissive τ=0.1τ=0.1 threshold used in [16], essentially 100% of heads qualify as retrieval heads. The reported fraction is thus sensitive to the threshold choice: varying τ from 0.1 to 0.3 shifts the classified fraction from ∼100% 100\% to ∼90% 90\%. The contrast with DuoAttention’s 25–50% retrieval fraction reflects both (i) different models (Llama-2/Mistral vs. Qwen) and (i) DuoAttention’s use of a learned gating function optimized on calibration data, which imposes a stricter criterion than a simple threshold on attention mass. In practice, the threshold can be tuned per deployment scenario to trade off between the number of heads served photonically and the complexity of the photonic accelerator. • Layer distribution. The highest-scoring retrieval heads are concentrated in layers 14–26, with peak retrieval ratios exceeding 0.93. • GQA effect. Because GQA shares KV heads across multiple query heads, the number of KV cache entries requiring retrieval-style treatment is even smaller than the head count suggests. The key implication for Prism is that the photonic accelerator needs to serve the vast majority of KV heads—102 out of 112 for Qwen2.5-7B and 258 out of 288 for Qwen3-8B at 8K context. However, GQA sharing means each KV head serves multiple query heads, so the number of independent weight bank instances required equals the KV head count, not the query head count. V.2 Block Signature Design We partition the KV cache into contiguous blocks of B tokens and compute a d-dimensional signature for each block [54]. We evaluate mean-key and random projection signature methods from Sec. I.2 at block size B=128B=128 and signature dimensions d∈16,32,64,128d∈\16,32,64,128\, using Qwen2.5-7B at context length n=4096n=4096. Our experiments identify B=128B=128 with d=32d=32 and mean-key projection as the best configuration. At the primary operating point k=32k=32, table 7 shows R@32 = 100% at 8K context (B=128B=128, 64 blocks), confirming that the signature ranking correctly identifies all relevant blocks. At 16K, R@32 drops to 57.5%, yet downstream NIAH accuracy remains 100% (table 9), indicating that task-critical blocks are consistently ranked in the top-k even when overall recall is imperfect. As a stress-test analysis at k=8k=8, R@8 = 77.3% (R@2 = 31.3%, R@4 = 50.0%), confirming that useful ranking signal persists even under aggressively small selection budgets. Mean-key projection consistently outperforms random projection across all tested dimensions, confirming that the natural key-space geometry contains exploitable structure for block ranking. Why mean-key and random projections? We focus on mean-key and random projection signatures because they are model-agnostic and require no training, matching our goal of a general-purpose photonic hardware interface. Learned projections (e.g., trained linear maps optimized for recall) could improve signature quality but would require per-model fine-tuning and hardware-aware training, which we leave to future work (Sec. VII.3). Figure 12: Recall@k as a function of signature dimension d for different signature methods. Block size B=128B=128, k=8k=8 (stress-test setting). Mean-key projection consistently outperforms random projection, achieving 77.3% recall@8 at d=32d=32. At the primary operating point k=32k=32, recall reaches 100% at 8K context (table 7). Signed weight encoding. The add-drop MRR configuration with balanced photodetection enables direct encoding of signed weights w∈[−1,+1]w∈[-1,\,+1], eliminating the non-negative constraint of through-port-only architectures. Compared to ReLU projection (which discards sign information, losing ∼50% 50\% of the signature variance for zero-mean Gaussian signatures), balanced photodetection preserves the full signed inner product. Our simulations show that signed encoding improves Recall@8 by ∼87% 87\% relative to ReLU projection at d=32d=32 (Supplementary fig. S5). Recall metric. We define recall@k [55] as the fraction of the true top-k blocks (by exact query–key inner product) that appear in the Prism-selected top-k blocks: Recall@k=|PRISM∩exact|k,Recall@k= |S_PRISM _exact|k, (14) where PRISMS_PRISM and exactS_exact are the sets of top-k block indices selected by Prism and by exact computation, respectively. Table 7: Recall@k for Prism block selection across context lengths. Qwen2.5-7B, B=128B=128, d=32d=32, mean-key projection. Values averaged over 15 (layer, head) pairs. n Blocks R@8 (%) R@16 (%) R@32 (%) NIAH (%) 4K 16 46.7 100 —∗ 100 8K 32 29.2 55.8 100 100 16K 64 26.7 41.7 57.5 100 32K 128 (OOM†) 100 64K 256 — 100 ∗Only 16 blocks at 4K; k=32k=32 exceeds total. †Eager attention OOM at 32K; NIAH uses SDPA (no attention matrix). Traffic reduction. At the primary operating point k=32k=32, the traffic ratio is kB/n=32×128/nkB/n=32× 128/n. At 128K tokens (N=1024N=1024 blocks), PRISM selects k=32k=32 of N=1024N=1024 blocks, yielding a N/k=1024/32=32×N/k=1024/32=32× traffic reduction (3.1% traffic). At 1M tokens (N≈7812N≈ 7812 blocks), the reduction grows to N/k≈7812/32≈244×N/k≈ 7812/32≈ 244× (0.41% traffic), though model accuracy at such lengths remains model-dependent. Under the stress-test setting k=8k=8, the reduction reaches 128×128× at 128K and projects to ∼977× 977× at 1M tokens (see Supplementary fig. S6). V.3 NIAH Accuracy Under Hardware Impairments To validate that the MRR-impaired block selection preserves end-to-end language model performance, we integrate the MRR array simulator into Qwen2.5-7B [52] and evaluate on the Needle-in-a-Haystack (NIAH) benchmark [15, 56]. For each decode step, block signatures (mean-key, d=32d=32) are processed through the MRR simulator to select the top-k blocks. Retrieval heads (Rh>0.3R_h>0.3; table 6) use MRR-selected blocks plus a 256-token recent window; streaming heads retain full attention. We test four MRR configurations: (i) ideal (floating-point inner product), (i) 5-bit/20 pm20\,pm drift (nominal), (i) 4-bit/30 pm30\,pm drift (pessimistic), and (iv) 5-bit/10 pm10\,pm drift (optimistic). Table 8: NIAH accuracy (%) with MRR-integrated block selection (Qwen2.5-7B, 11 positions, k=8k=8 stress-test setting). Configuration 2K 4K 8K Full attention 90.9 100.0 100.0 Ideal select 90.9 100.0 100.0 5-bit, 20 pm 90.9 100.0 100.0 4-bit, 30 pm 90.9 100.0 100.0 5-bit, 10 pm 90.9 100.0 100.0 Table 8 shows that all four MRR configurations—including the worst-case 4-bit quantization with 30 pm30\,pm thermal drift—achieve identical NIAH accuracy to full attention at all tested context lengths. The single miss at 2K context (position 50%) is a model-level artifact unrelated to block selection. These results demonstrate that the MRR impairments modelled in Sec. IV.1 do not degrade downstream task accuracy for the block-selection ranking task. To validate Prism across a wide range of context lengths, we extend the evaluation using SDPA-based attention (Flash Attention) with KV cache offloading to CPU RAM via OffloadedCache. This enables experiments at context lengths from 4K to 128K on a single GPU (RTX 5880, 48 GB48\,GB) with 128 GB128\,GB system RAM. We note that Qwen2.5-7B’s native context window is 128K tokens; at 128K, the base model’s own accuracy degrades to 45.5% on NIAH (table 9), limiting meaningful evaluation beyond 64K. Extrapolation to longer contexts (e.g., 1M tokens via YaRN [6] rope scaling) is technically feasible for the photonic hardware, but model-level accuracy at such lengths remains an open challenge independent of block selection. For sparse evaluation, we employ physical token selection: rather than re-attending to all tokens with a mask, only the tokens from the top-k selected blocks and a recent window are assembled into a compact input (∼ 5K tokens), preserving positional encoding via explicit position_ids. This approach mirrors the actual deployment scenario where only selected KV blocks are fetched from memory. Table 9 presents the extended NIAH results across context lengths from 4K to 128K. The full 2D NIAH heatmap (context length × needle depth, 10 positions per context) is shown in fig. 13. At k=32k=32 blocks (B=128B=128), all MRR configurations achieve 100% accuracy from 4K through 64K, perfectly matching full attention. At 128K, the base model itself degrades to 45.5%—a known limitation of Qwen2.5-7B’s context window—making sparse-vs-full comparison uninformative at this length. Within the model’s reliable operating range (N≤64N≤ 64K), MRR block selection introduces zero accuracy penalty while reducing KV memory traffic by 16×16× at 64K (k⋅B/n=32×128/65536=6.25k· B/n=32× 128/65536=6.25%); the reduction grows to 32×32× (3.1%) at 128K. Table 9: Extended NIAH accuracy (%) with MRR block selection (Qwen2.5-7B, B=128B=128, d=32d=32, k=32k=32, 10 positions per context). At 128K the base model degrades († ). Configuration 4K 8K 16K 32K 64K 128K† Full attention 100 100 100 100 100 45.5 Ideal (k=32k=32) 100 100 100 100 100 18.2‡ 5-bit, 20 pm 100 100 100 100 100 27.3‡ 4-bit, 30 pm 100 100 100 100 100 27.3‡ 5-bit, 10 pm 100 100 100 100 100 27.3‡ ‡At 128K, full attention itself degrades to 45.5%; the apparent superiority of impaired configurations over ideal is within the ±9.1±9.1% sampling noise of 11 needle positions and is not statistically significant. Figure 13: NIAH accuracy heatmap across context lengths (4K–128K) and MRR configurations. At k=32k=32, all MRR variants match full attention perfectly up to 64K. Under the k=8k=8 stress test, accuracy degrades gracefully with context length but remains above 90% at 16K. The 128K column shows model-intrinsic degradation (full attention itself drops to 45.5%). VI Photonic Scaling Analysis We now analyze how the photonic engine scales to larger systems, identifying constraints from WDM channel density, thermal power, chip area, and time-multiplexed operation. VI.1 MRR Integration Scaling The total MRR count in the Prism weight bank is NMRR=d×N,N_MRR=d× N, (15) where d is the number of WDM wavelength channels (signature dimension) and N is the number of parallel signature banks (one per KV cache block). For a context length of n tokens with block size B, N=n/BN=n/B. Table 10 lists representative configurations spanning three orders of magnitude in MRR count. Table 10: MRR count for representative Prism configurations. The rightmost column indicates the approximate context length supported at block size B=128B=128. d N NMRRN_MRR Context Feasibility 32 256 81928192 32K Current TFLN 64 1024 65 53665\,536 128K Near-term 128 4096 524 288524\,288 512K Multi-chip Current photonic integration supports 10410^4–10510^5 active devices per die [35, 57], placing the d=32d=32, N=256N=256 configuration within demonstrated capability and d=64d=64, N=1024N=1024 at the near-term frontier. The d=128d=128, N=4096N=4096 configuration exceeds single-chip density, requiring chiplet-based multi-chip modules [58] (Sec. VI.3). VI.2 Thermal Power Budget and WDM Channel Limits On thermo-optic SOI platforms [49], each MRR requires ∼2.5 mW $2.5\,mW$ of static heater power, yielding aggregate budgets of 20 W20\,W (d=32d=32, N=256N=256) to 164 W164\,W (d=64d=64, N=1024N=1024)—approaching the ∼200 W $200\,W$ practical limit with active cooling. On the TFLN platform, MRR tuning via the Pockels effect (r33=30.9 pm/Vr_33=$30.9\,pm/V$) is capacitive with near-zero static power (<1 µW<$1\, W$ per ring from CMOS driver leakage): PstaticTFLN=d×N×Pleakage<d×N×1 µW.P_static^TFLN=d× N× P_leakage<d× N×$1\, W$. (16) For the d=64d=64, N=1024N=1024 configuration, Pstatic<0.07 WP_static<$0.07\,W$—a ∼2400× 2400× reduction over SOI. The switching energy (∼5 fJ $5\,fJ$ per ring) yields <0.3 µW<$0.3\, W$ total switching power at typical decode rates—negligible. Residual thermal stabilization via TEC (∼1 W $1\,W$ for a ∼1 cm2 $1\,cm^2$ chip) remains necessary but is orders of magnitude below SOI heater budgets. TFLN’s lower thermo-optic coefficient (dn/dT≈4×10−5 K−1dn/dT≈$4×10^-5\,K^-1$ vs. 1.8×10−4 K−11.8×10^-4\,K^-1 for Si) further reduces thermal crosstalk. WDM channel limits. The signature dimension d is constrained by the MRR free spectral range (FSR) [59]. A single-FSR MRR (R=20 µmR=$20\, m$, FSR ≈8.3 nm≈$8.3\,nm$) supports only ∼5 5 channels at 200 GHz200\,GHz spacing. Vernier-coupled dual-ring filters extend the effective FSR to ∼50 nm $50\,nm$ (d∼30d 30); C+L band operation (95 nm95\,nm) enables d∼60d 60. Achieving d=128d=128 requires FSR extension with C+L+S band operation (Supplementary Section S3). VI.3 Chip Area Estimation The 32×25632× 256 configuration (81928192 MRRs) fits on a single ∼5×5 5× 5 m2 die; the 64×102464× 1024 configuration requires multi-chip packaging or folded layouts. Detailed area estimates are provided in Supplementary Section S4. Figure 14 summarizes the scaling trend. Figure 14: Prism photonic scaling projections. MRR count, aggregate heater power (SOI), and estimated chip area as functions of the configuration parameters d and N. The dashed horizontal lines indicate practical limits: 200 W200\,W thermal dissipation (active cooling) and 858 m2858\,m^2 single-reticle area. Configurations below both limits (shaded region) are realizable on a single photonic chip. VI.4 Time-Multiplexed Operation Area and power constraints can be relaxed by trading physical parallelism for temporal reuse via time-multiplexed weight programming [60]. The system deploys NphysN_phys physical rows and cycles through M weight configurations: Nlogical=M×Nphys,M=⌈N/Nphys⌉.N_logical=M× N_phys, M= N/N_phys . (17) On TFLN, EO reprogramming is sub-nanosecond (treprogram≪topticalt_reprogram t_optical), so the total latency simplifies to ttotal≈M×topticalt_total≈ M× t_optical. Even at M=8M=8, the total latency (80 ns80\,ns) remains four orders of magnitude below the GPU full-scan baseline (∼200 µs $200\, s$)—a fundamental advantage over thermo-optic SOI (treprogram∼10 µst_reprogram $10\, s$). For LLM decode at 128K+ context, M=4M=4–88 is a practical sweet spot: it reduces physical MRR count by 44–8×8× (to 81928192–16 38416\,384), keeps chip area within a single reticle, and resolves the area scaling barrier of Sec. VI.3, making d=64d=64 realizable with current TFLN technology (fig. 15). Figure 15: Time-multiplexed Prism operation. (a) Physical MRR count vs. multiplexing factor M (d=64d=64, N=1024N=1024). (b) Query evaluation latency vs. M, compared to electronic baselines. The shaded region indicates the sweet spot (M=4M=4–88). VI.5 Energy and Latency Crossover We define the crossover point n∗n^* as the context length at which Prism-assisted decoding cost equals the electronic baseline. The Prism cost (photonic selection energy ∼931 pJ $931\,pJ$ per query plus reduced GPU fetch) is compared against the GPU full-scan cost (fetching all N blocks via HBM). On TFLN, near-zero static power means the selection cost is dominated by dynamic components. The full derivation is in Supplementary Section S2. Figure 16: Energy crossover map for Prism vs. electronic baselines. The crossover contour (CPRISM/Cbaseline=1C_PRISM/C_baseline=1) shifts to shorter context lengths as d decreases. (a) vs. GPU full scan: practical benefit at n∗≈4n^*≈ 4K. (b) vs. GPU ANN: n∗≈2n^*≈ 2K. (c) vs. NVIDIA ICMS: n∗≈4n^*≈ 4K. Energy crossover. Against the GPU full scan (fig. 16a), the mathematical crossover occurs at n∗<1n^*<1K tokens (d=64d=64, Nbank=4N_bank=4); practical benefit emerges at n≥4n≥ 4K where traffic reduction exceeds 8×8×. The per-query dynamic energy (∼931 pJ $931\,pJ$; ∼9.9 nJ $9.9\,nJ$ with amortized TEC) is five orders of magnitude below the H100 fetch energy at 128K context (∼48 µJ $48\, J$). This GPU baseline assumes a full-dimension scan (dh=128d_h=128). A fairer comparison lets the GPU scan compressed d=32d=32 signatures, reducing scan energy to Escanfair≈12 µJE_scan^fair≈$12\, J$. Even under this fairer comparison where the GPU scans compressed d=32d=32 signatures (∼12μ 12\, ), Prism maintains a four-order-of-magnitude advantage (∼931 pJ $931\,pJ$ vs. ∼12 µJ $12\, J$), preserving a comfortable crossover margin. On thermo-optic SOI, the ∼164 W $164\,W$ heater power would place the crossover at n∗≈4n^*≈ 4K. Against GPU ANN (FAISS IVF-PQ) [61] (O(N)O( N) scan reduction), the crossover is at n∗≈2n^*≈ 2K. Against NVIDIA ICMS (DPU with lower bandwidth than GPU HBM), n∗≈4n^*≈ 4K based on estimated GTC 2024 specifications. Latency crossover. The ∼9 ns $9\,ns$ photonic evaluation is orders of magnitude below the ∼5 µs $5\, s$ GPU scan, so the selection step is effectively free in latency terms (n∗≲4n^* 4K tokens). Sensitivity. The dominant factors are: (i) dynamic power (∼88 mW $88\,mW$, dominated by TIAs/ADCs/DACs, negligible vs. electronic baselines); (i) signature dimension d (controls MRR count and area, traded against recall quality, Sec. V.2); (i) bank count NbankN_bank (splitting loss vs. parallelism); and (iv) HBM bandwidth (HBM4 improvements shift the crossover to longer contexts). Scaling projections. The energy ratio CPRISM/CGPUC_PRISM/C_GPU decreases as ∼1/n 1/n because electronic scan cost grows linearly while Prism selection cost is fixed. At n=1 000 000n=$1\,000\,000$ (N≈8000N≈ 8000 blocks), the GPU reads ∼1 MB $1\,MB$ of signatures per head per query; Prism accommodates this with Nbank=8N_bank=8 banks (512 000512\,000 MRRs total). In multi-agent scenarios, a single weight bank serves A agents simultaneously (only the query sketch changes), amortizing dynamic power by 1/A1/A. Figure 17: Energy advantage vs. context length. The ratio decreases as ∼1/n 1/n; at 1 000 0001\,000\,000 tokens (B=128B=128, k=32k=32), the traffic reduction reaches 244×244×. VII Discussion VII.1 Limitations and Practical Considerations All hardware results in this work are based on device-level simulations with parameters extracted from FDTD and supplemented by literature values; no physical prototype has been fabricated or measured. The impairment models, while grounded in FDTD simulation and published device data, may not capture all fabrication-dependent effects such as waveguide roughness variations, EO electrode non-uniformity across a large array, and packaging-induced stress. At d=64d=64 and N=1024N=1024, the system requires 65 53665\,536 MRRs; systematic characterization of >>10 00010\,000-MRR arrays on TFLN has not been reported, though recent progress in TFLN foundry processes suggests that large-scale integration is feasible [44]. Fabrication non-uniformity causes resonance wavelength variations of σλ∼ _λ 0.5 nm to 2 nm0.5\,nm2\,nm across a wafer [10], but on the TFLN platform EO tuning can compensate via DC bias adjustment without static power penalty. Residual thermal drift, while mitigated by lithium niobate’s ∼4× 4× lower thermo-optic coefficient compared to silicon, still requires chip-level thermal stabilization (∼1 W $1\,W$ TEC budget). Including the TEC thermal stabilization overhead (amortized at >1000>$1000$ queries/s throughput), the per-query energy rises from the ∼931 pJ $931\,pJ$ dynamic-only figure to ∼9.9 nJ $9.9\,nJ$; the 931 pJ931\,pJ value cited elsewhere in this paper refers to the dynamic photonic pipeline alone. Interface latency. The ∼9 ns $9\,ns$ latency reported for Prism reflects the photonic pipeline alone (DAC through top-k selection) and does not include the host interface overhead. A PCIe 5.0 round-trip (DMA setup and transfer) adds ∼ 1 µs to 2 µs1\, s2\, s; CXL-attached memory semantics reduce this to ∼ 200 ns to 500 ns200\,ns500\,ns; direct interposer or co-packaged integration would add only ∼ 10 ns to 50 ns10\,ns50\,ns. Even with PCIe overhead, the total system latency of ∼2 µs $2\, s$ remains below the GPU full-scan latency (∼5 µs $5\, s$), yielding a system-level 22–3×3× speedup. Co-packaging—the long-term integration target—would preserve the ∼100× 100× raw photonic advantage. Speedup claims should therefore be interpreted as system-level 22–3×3× with PCIe, potentially 100×100× with co-packaging. Demonstrated vs. projected scale. To clarify the maturity of the MRR integration scales assumed in this work: demonstrated TFLN arrays have reached ∼ 10–100 MRRs [44], while SOI platforms have demonstrated ∼ 10001000–10 00010\,000 MRRs [35]. Prism’s “current” configuration (81928192 MRRs at d=32d=32, N=256N=256) is a projected design point that extrapolates from these demonstrations; the flagship configuration (65 53665\,536 MRRs at d=64d=64, N=1024N=1024) is also projected and would likely require multi-chip or wafer-scale integration. At d=64d=64 and N=1024N=1024, the system requires 65 53665\,536 individually addressable voltage bias lines for fabrication-offset compensation of each MRR, presenting a significant packaging and routing challenge that will require advanced fan-out or interposer-based solutions. The add-drop MRR configuration with balanced photodetection resolves the sign limitation of through-port-only architectures. The balanced differential photocurrent Ithrough−IdropI_through-I_drop naturally encodes signed weights in [−1,+1][-1,+1], enabling true signed inner products without ReLU projection or split encoding. The trade-off is a doubling of the photodetector count (two PDs per channel), but since PDs are orders of magnitude smaller than MRRs, the area penalty is negligible. The retrieval head classification threshold τ=0.3τ=0.3 used throughout this work becomes less discriminative at longer contexts, where most heads tend to exhibit high retrieval scores; DuoAttention’s learned gating identifies only 25–50% of heads as retrieval heads. The 90%+ fraction reported here should therefore be interpreted as an upper bound estimate at the evaluated context lengths. For multi-head serving, GQA [28] reduces the number of independent weight bank instances from the retrieval head count (102 for Qwen2.5-7B) to the KV head count (HKV=4H_KV=4), since block signatures are derived from key vectors at KV-head granularity. These 4 heads can be served by time-multiplexed reprogramming (∼4 ns $4\,ns$ on TFLN, negligible vs. the ∼5 µs $5\, s$ KV fetch) or by parallel replication of 4 weight banks. When the layer dimension is included, the full configuration space is HKV×L=4×28=112H_KV× L=4× 28=112 weight bank instances per decode step. Under time-multiplexing, this amounts to 112×∼1 ns≈112 ns112× $1\,ns$≈$112\,ns$ total reprogramming overhead—still ∼45× 45× smaller than a single KV block fetch (∼5 µs $5\, s$) and therefore negligible in the decode-step budget. Alternatively, a layer-parallel deployment with 28 Prism chips (one per layer, each serving 4 KV heads) would eliminate the layer serialization entirely at the cost of additional chip area. VII.2 Comparison with Related Approaches The block-level top-k selection mechanism at the core of Prism builds on a strategy independently validated by several works: Quest [18] preserves over 99% of full-attention accuracy on long-context benchmarks including NIAH up to 1M tokens, DuoAttention [16] maintains LongBench performance within 1–2% of full attention, and InfLLM [20] and RocketKV [19] provide additional evidence for block-level selection at long context. Prism’s contribution is orthogonal: the key question is not whether block selection preserves quality (answered affirmatively above) but whether MRR-based analog computation introduces sufficient error to degrade the selection. Our NIAH results (Sec. V.3) confirm that it does not, even under pessimistic hardware impairments. Tian et al.’s photonic transformer chip (PTC) [13] demonstrates that coherent optical interference can implement full transformer attention with high throughput (>>200 POPS); however, it targets dense attention computation rather than the coarse block-selection task addressed by Prism, and its O(n)O(n) memory access scaling remains for long-context KV caches. InfLLM is the most directly comparable system, as it offloads the full KV cache to CPU RAM and retrieves blocks via electronic inner products. The key distinction is selection latency scaling: InfLLM’s selection time grows as O(N)O(N) with the number of cached blocks, while Prism’s photonic engine evaluates all N similarities in O(1)O(1) optical transit time. This advantage grows with context length—precisely the regime where the KV cache bottleneck is most severe. Relative to Quest [18] and RocketKV [19], which perform block selection digitally on the GPU, Prism targets a different bottleneck: these methods reduce compute by pruning low-scoring KV blocks but still require the GPU to read all block signatures from HBM (costing O(N)O(N) memory traffic per decode step). Prism eliminates this signature scan entirely by offloading it to a photonic co-processor with O(1)O(1) latency and near-zero energy, making it complementary—Quest- or RocketKV-style scoring policies could be used to define which blocks are selected, while Prism accelerates the execution of that selection. The GPU ANN baseline used in our crossover analysis (FAISS IVF-PQ [61]) represents a well-established but not state-of-the-art GPU search library; more recent GPU-accelerated ANN libraries (e.g., CAGRA, cuVS) may further reduce the electronic baseline latency and energy, narrowing the crossover window. NVIDIA’s ICMS [8] addresses the complementary capacity problem (terabyte-scale flash-backed KV storage with DPU-managed prefetch), while Prism solves the selection problem via photonic parallel inner products. Note that the ICMS energy and bandwidth specifications used in our comparisons are estimated from public announcements; no published measurements are available, and actual performance may differ. A natural integration would place Prism within or adjacent to the ICMS, combining storage capacity with photonic selection speed. The recently announced NVIDIA Rubin platform [62] further underscores industry momentum toward dedicated KV cache acceleration hardware, complementary to Prism’s photonic approach. VII.3 Outlook The immediate next step is fabrication of a small-scale TFLN MRR prototype (8×88× 8 weight bank) to validate inner-product accuracy under real device impairments and provide measured values for parameters currently extracted from simulation. Scaling to a full module (d=64d=64, N=256N=256) integrated with GPU-based LLM inference would validate the crossover predictions of Sec. VI.5. Integrating non-volatile weight storage (e.g., phase-change trimming [63, 64]) could further reduce switching energy for quasi-static block signatures [65, 36]. More challenging benchmarks such as SCBench [66] and query-focused retrieval analysis [67] would strengthen confidence in the robustness of photonically selected blocks beyond the NIAH validation presented here. Practical integration. A deployable Prism module would package the photonic chip, laser source, and TEC onto a single substrate, offered in one of three form factors: a PCIe add-in card for drop-in datacenter use, a CXL-attached device for lower-latency memory-semantic access, or a co-packaged chiplet on an interposer for maximum performance. Integration with existing LLM serving stacks (e.g., vLLM, TensorRT-LLM) would proceed via a block-index API: the host submits a query sketch and receives ranked block indices, transparently replacing the software signature-scan kernel. Benchmark scope. NIAH is a retrieval-oriented benchmark that tests single-needle recall; it does not exercise multi-hop reasoning, summarization, or other long-context capabilities. Our results therefore validate retrieval fidelity but not general long-context quality. We note, however, that the block selection mechanism is inherited from Quest [18] and InfLLM [20], which have been validated on broader benchmarks (LongBench, ∞Bench); Prism’s contribution is the photonic hardware mapping of this selection, not the selection algorithm itself. VIII Conclusion We have presented Prism, a TFLN photonic similarity engine that computes all N block-selection inner products in O(1)O(1) optical latency via the broadcast-and-weight paradigm. End-to-end NIAH evaluation confirms that MRR-selected block-sparse attention preserves full-attention accuracy from 4K to 64K tokens (within the model’s native context window) under realistic hardware impairments (4–5 bit weights, 30 pm30\,pm thermal drift), while reducing KV cache traffic by 16×16× at 64K context (k=32k=32, B=128B=128; 32×32× at 128K). At longer contexts (128K+), model-intrinsic accuracy degrades independent of block selection; the photonic scaling analysis nevertheless projects favorable energy and latency scaling to million-token regimes as model context windows continue to expand. The practical energy benefit emerges at n≥4n≥ 4K where block selection yields meaningful traffic reduction, making Prism favorable across virtually all practical context lengths. Future work will proceed along three axes: (i) fabrication and characterization of an 8×88× 8 TFLN MRR weight bank to validate simulation predictions with measured device parameters; (i) scaling to a full d=64d=64, N=256N=256 module integrated with GPU-based inference for end-to-end latency and energy measurements; and (i) integration of non-volatile weight storage (e.g., phase-change trimming [63, 64]) for write-once signature programming, together with hardware-aware learned projections and broader benchmarks such as SCBench [66]. More broadly, photonic broadcast search may serve as a general paradigm for similarity-search workloads in data centers—including approximate nearest-neighbor retrieval, recommendation ranking, and embedding lookup—wherever a single query must be compared against a large, slowly changing set of stored vectors. Disclosures The authors declare no conflicts of interest. Data Availability Code and simulation data are available at https://github.com/hyoseokp/PRISM [68]. Supplementary Information Appendix S1 Device Impairment Models This section provides the full mathematical models for the six impairment sources that degrade the ideal inner-product computation of eq. 6. Weight quantization. MRR transmission is programmed via electro-optic tuning with finite precision. We model the quantized weight as w^n,j=round(wn,j⋅2b)2b, w_n,j= round(w_n,j· 2^b)2^b, (S1) where b is the effective bit precision. Values of b=4b=4–88 are considered, corresponding to 16–256 distinguishable transmission levels. Thermal drift. After initial calibration, the MRR resonance wavelength drifts due to ambient temperature fluctuations. We model the drift as a Gaussian random walk: Δλ0(t)=∑i=1t/Δt(0,σdrift2), _0(t)= _i=1^t/ tN(0, _drift^2), (S2) with σdrift _drift chosen to produce a standard deviation of 0.01 nm to 0.1 nm0.01\,nm0.1\,nm over a calibration interval TcalT_cal. The resulting weight error is Δw=|∂T∂λ|Δλ0≈8Q2Dmaxλ02⋅Δλ0, w= | ∂ T∂λ | _0≈ 8Q^2D_ _0^2· _0, (S3) evaluated at the operating point on the MRR Lorentzian. Note that the i.i.d. Gaussian model above does not capture spatially correlated drift (e.g., center-to-edge temperature gradients across the chip), which could cause systematic bias in the inner-product scores rather than zero-mean random noise; such gradients would require a correlated noise model or per-region calibration. Insertion loss. Each MRR introduces an off-resonance insertion loss ILMRR≈0.02 dB to 0.05 dBIL_MRR≈$0.02\,dB0.05\,dB$, and the 1×N1× N splitter contributes ILsplitIL_split from eq. 8. The total channel loss is ILtotal=ILsplit+d⋅ILMRR+ILwg,IL_total=IL_split+d·IL_MRR+IL_wg, (S4) where ILwgIL_wg accounts for waveguide propagation loss (∼ 0.3 dB/cm0.3\,dB/cm for TFLN). High insertion loss reduces the SNR at the photodetector and increases the required laser power. Photodetector noise. The photocurrent at each detector includes shot noise and thermal noise: σI2=2eIphΔf+4kBTRLΔf+NEP2⋅Δf, _I^2=2eI_ph\, f+ 4k_BTR_L\, f+NEP^2· f, (S5) where IphI_ph is the signal photocurrent, Δf f is the detection bandwidth, RLR_L is the load resistance, and NEP is the noise-equivalent power of the photodetector (∼10 pW/Hz $10\,pW/ Hz$ for Ge-on-Si) [69, 70]. The noise introduces a random perturbation to the inner-product score, potentially reordering the top-k ranking. MRR crosstalk. Adjacent MRRs on the same bus waveguide can exhibit spectral overlap if the channel spacing is insufficient relative to the MRR linewidth. We model inter-channel crosstalk as an additive interference with isolation of −15 dB to −30 dB-15\,dB-30\,dB: yn=∑j=1dwn,jsj+∑j=1d∑m≠jχj,mwn,msm,y_n= _j=1^dw_n,j\,s_j+ _j=1^d _m≠ j _j,m\,w_n,m\,s_m, (S6) where χj,m _j,m is the crosstalk coefficient from channel m to channel j [71]. Input DAC noise. The finite resolution and integral nonlinearity (INL) of the input DACs contribute an additional noise floor on the query sketch values. At bDAC=6b_DAC=6 bits, the quantization noise standard deviation is σDAC=2−bDAC/12≈0.0045 _DAC=2^-b_DAC/ 12≈ 0.0045. Figure S1: Impact of weight quantization on recall. At b=6b=6 bits, recall degrades by less than 5% from the floating-point ideal (Recall@8 = 0.960 at 6-bit). Adding thermal drift (σth=0.01 _th=0.01) and detector noise (σdet=0.01 _det=0.01) degrades recall by an additional 5%. Figure S2: Recall degradation as a function of thermal drift σ. Recall remains above 95% for σ≤0.005σ≤ 0.005 (corresponding to ∼ 150 pm150\,pm drift), achievable with standard thermal stabilization. At σ=0.01σ=0.01 (∼ 300 pm300\,pm), recall is still 94.8%. Figure S3: Weight encoding fidelity under MRR impairments (d=32d=32, N=128N=128). (a) Ideal vs. 5-bit quantised weights in [0,1][0,1]: the staircase pattern shows 32 discrete levels with RMSE =0.009=0.009. (b) Adding 20 pm20\,pm thermal drift and fabrication variation broadens the scatter (RMSE =0.014=0.014). (c) Error histograms for three configurations: 5-bit only, 5-bit with 20 pm20\,pm drift, and 4-bit with 30 pm30\,pm drift. Even the pessimistic case concentrates errors within ±5± 5% of the full weight range. Figure S4: Recall@8 degradation as a function of photodetector noise σdet _det (d=32d=32, N=500N=500, 100 trials). Recall remains above the 95% threshold for σdet≤0.01 _det≤ 0.01, achievable with standard Ge-on-Si photodetectors (NEP ∼ 1 pW/Hz1\,pW/ Hz). Appendix S2 Crossover Derivation This section provides the full algebraic derivation of the energy crossover point n∗n^* summarized in Sec. VI.5. The total decode cost per token for a retrieval head consists of two terms: Ctotal=Cselect(n)+Cfetch(n,k),C_total=C_select(n)+C_fetch(n,k), (S7) where CselectC_select is the cost of determining which blocks to fetch and CfetchC_fetch is the cost of reading and computing attention over the selected blocks. For the GPU full scan baseline, no selection is needed (k=N=n/Bk=N=n/B), so CGPU=CfetchGPU(n,n/B)=2dhnbprecBWHBM,C_GPU=C_fetch^GPU(n,n/B)= 2\,d_h\,n\,b_precBW_HBM, (S8) where BWHBMBW_HBM is the HBM bandwidth (∼ 3.35 TB/s3.35\,TB/s for H100 HBM3). For Prism, the cost is CPRISM=CselectPRISM(n)+CfetchGPU(n,k⋅B),C_PRISM=C_select^PRISM(n)+C_fetch^GPU(n,k· B), (S9) where CselectPRISMC_select^PRISM is the dynamic energy of the photonic evaluation (laser, DAC, modulator, PD, ADC): CselectPRISM≈Edynamic,C_select^PRISM≈ E_dynamic, (S10) Note that on the TFLN platform, the static MRR tuning power is near zero (capacitive EO), so the selection cost is dominated entirely by the dynamic components (∼931 pJ $931\,pJ$ per query, table 4). The fetch cost is reduced by the selection ratio k/Nk/N: CfetchGPU(n,k⋅B)=2dhkBbprecBWHBM.C_fetch^GPU(n,k· B)= 2\,d_h\,k\,B\,b_precBW_HBM. (S11) The crossover occurs when CPRISM<CGPUC_PRISM<C_GPU, i.e., when the memory bandwidth saved by not fetching (N−k)(N-k) blocks exceeds the cost of operating the photonic selector. Setting CPRISM=CGPUC_PRISM=C_GPU and solving for n yields: n∗=CselectPRISM⋅BWHBM2dhbprec(1−kB/n∗),n^*= C_select^PRISM·BW_HBM2\,d_h\,b_prec\,(1-k\,B/n^*), (S12) which must be solved self-consistently since k and N=n/BN=n/B both depend on n. In practice, k is a fixed parameter (e.g., k=32k=32), so the selection ratio kB/n→0k\,B/n→ 0 as n→∞n→∞, and the crossover simplifies to n∗≈CselectPRISM⋅BWHBM2dhbprec.n^*≈ C_select^PRISM·BW_HBM2\,d_h\,b_prec. (S13) Appendix S3 WDM Channel Limits The maximum number of WDM channels d is constrained by the MRR free spectral range (FSR), the available optical bandwidth, and inter-channel crosstalk. Single-FSR constraint. For a TFLN MRR with radius R=20 µmR=$20\, m$, the FSR is approximately 8.3 nm8.3\,nm (at λ0=1550 nm _0=$1550\,nm$, as in table 1). At a channel spacing of Δλch=1.6 nm _ch=$1.6\,nm$ (200 GHz200\,GHz on the ITU grid) [72], the maximum number of non-aliased channels within one FSR is dmax(1-FSR)=⌊FSRΔλch⌋=⌊8.31.6⌋=5.d_ ^(1-FSR)= FSR _ch = 8.31.6 =5. (S14) This is clearly insufficient for the d=32d=32–128128 range targeted by Prism. FSR extension techniques. Vernier-coupled dual-ring filters or cascaded MRRs with slightly different radii can extend the effective FSR to ∼ 50 nm50\,nm or more [46, 73], limited by the least common multiple of the individual FSRs. With an extended FSR of 50 nm50\,nm and Δλch=1.6 nm _ch=$1.6\,nm$: dmax(Vernier)=⌊501.6⌋≈30.d_ ^(Vernier)= 501.6 ≈ 30. (S15) Band-limited operation. The usable optical bandwidth depends on the operating band: • C-band (1530–1565 nm1565\,nm): 35 nm35\,nm → practical limit d∼20d 20–3030 without FSR extension; • C+L band (1530–1625 nm1625\,nm): 95 nm95\,nm → d∼60d 60. For d>60d>60, extending to the S-band or using 0.8 nm0.8\,nm (100 GHz100\,GHz) channel spacing is necessary, at the cost of tighter crosstalk margins. In practice, achieving d=128d=128 requires both FSR extension and C+L+S band operation, representing a more aggressive photonic design point. S3.1 Balanced Photodetection for Signed Weights The add-drop MRR configuration provides both through-port transmission Tthrough(Δλ)T_through( λ) and complementary drop-port transmission Tdrop(Δλ)=1−Tthrough(Δλ)T_drop( λ)=1-T_through( λ) (for a lossless ring). A balanced photodetector pair measures the differential signal: w=Tthrough−Tdrop=2Tthrough−1∈[−1,+1].w=T_through-T_drop=2\,T_through-1∈[-1,\,+1]. (S16) This mapping naturally encodes signed weights without doubling the MRR count (as required by split encoding) or discarding sign information (as in ReLU projection). The noise model for balanced detection yields shot noise variance σ2=2eℛP0Δfσ^2=2eRP_0 f, independent of the programmed weight, since power is conserved: Pthrough+Pdrop=P0P_through+P_drop=P_0 [69]. Appendix S4 Multi-Head Serving with GQA A natural concern is whether serving all retrieval heads requires replicating the weight bank for each head. In Qwen2.5-7B, 102 out of 112 KV heads are retrieval heads at 8K context (table 6). However, GQA [28] reduces the number of independent KV heads to HKV=4H_KV=4—each KV head is shared across H/HKV=7H/H_KV=7 query heads. The weight bank stores block signatures derived from key vectors, so it operates at the KV-head granularity, not the query-head granularity. This means only 4 independent weight bank configurations are needed per layer, not 102. Two strategies can serve these 4 KV heads: 1. Time-multiplexed reuse. A single weight bank is reprogrammed sequentially for each of the 4 KV heads. On the TFLN platform, EO tuning settles in ∼1 ns $1\,ns$ (RC-limited), so reprogramming 4 heads adds only ∼4 ns $4\,ns$—negligible compared to the subsequent KV block fetch (∼5 µs $5\, s$). 2. Parallel replication. Four weight banks are deployed in parallel, one per KV head. This requires 4×d×N=4×64×1024=262 1444× d× N=4× 64× 1024=$262\,144$ MRRs total—a 4×4× increase over the single-head case but within the scalability limits discussed in Sec. VII.1. GQA thus reduces the multi-head serving problem from 102 independent banks (one per retrieval head) to just 4, making both time-multiplexed and spatially-parallel approaches practical. For Qwen3-8B (HKV=8H_KV=8), the same argument applies with 8 KV heads, still far below the 258 retrieval heads. Appendix S5 Additional Figures Figure S5: Signed vs. unsigned recall comparison. Balanced photodetection (signed [−1,+1][-1,+1]) consistently outperforms ReLU projection (unsigned [0,1][0,1]) and split encoding across all signature dimensions d. Figure S6: KV cache traffic reduction factor N/kN/k as a function of context length for different k values. References Dao [2024] T. Dao, FlashAttention-2: Faster attention with better parallelism and work partitioning, in International Conference on Learning Representations (2024) arXiv:2307.08691 (2023). Gholami et al. [2024] A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, AI and memory wall, IEEE Micro 44, 33 (2024). OpenAI et al. [2023] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, et al., GPT-4 technical report, arXiv preprint arXiv:2303.08774 (2023). Gemini Team Google [2024] Gemini Team Google, Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, arXiv preprint arXiv:2403.05530 (2024). Meta AI [2024] Meta AI, The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024). Qwen Team [2025a] Qwen Team, Qwen2.5-1M Technical Report, arXiv preprint arXiv:2501.15383 (2025a). NVIDIA [2025a] NVIDIA, NVIDIA Vera Rubin Platform, https://nvidianews.nvidia.com/news/nvidia-vera-rubin (2025a), announced at CES 2025; next-generation GPU architecture with integrated memory connectivity. NVIDIA [2025b] NVIDIA, Inference Context Memory Storage (ICMS): BlueField-4 DPU for LLM Inference, https://nvidianews.nvidia.com/news/nvidia-vera-rubin (2025b), flash-backed KV cache with hardware-assisted eviction and prefetch. Tait et al. [2014] A. N. Tait, M. A. Nahmias, B. J. Shastri, T. F. de Lima, and P. R. Prucnal, Broadcast and weight: An integrated network for scalable photonic spike processing, Journal of Lightwave Technology 32, 3427 (2014). Tait et al. [2016] A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A. Nahmias, and P. R. Prucnal, Microring weight banks, IEEE Journal of Selected Topics in Quantum Electronics 22, 312 (2016). Hua et al. [2025] S. Hua, E. Divita, S. Yu, B. Peng, C. Roques-Carmes, Z. Su, Z. Chen, Y. Bai, J. Zou, Y. Zhu, Y. Xu, C. Lu, Y. Di, H. Chen, L. Jiang, L. Wang, L. Ou, C. Zhang, J. Chen, W. Zhang, H. Zhu, W. Kuang, L. Wang, H. Meng, M. Steinman, and Y. Shen, An integrated large-scale photonic accelerator with ultralow latency, Nature 640, 361 (2025). Zhu et al. [2024a] H. Zhu, J. Gu, H. Wang, Z. Jiang, Z. Zhang, R. Tang, C. Feng, S. Han, R. T. Chen, and D. Z. Pan, Lightening-Transformer: A dynamically-operated optically-interconnected photonic transformer accelerator, in Proc. IEEE International Symposium on High-Performance Computer Architecture (HPCA) (2024) p. 686–703. Tian et al. [2025] Y. Tian, S. Xiang, X. Guo, Y. Zhang, J. Xu, S. Shi, H. Zhao, Y. Wang, X. Niu, W. Liu, and Y. Hao, Photonic transformer chip: interference is all you need, PhotoniX 6, 45 (2025). Fu et al. [2024] T. Fu, J. Zhang, R. Sun, Y. Huang, W. Xu, S. Yang, Z. Zhu, and H. Chen, Optical neural networks: progress and challenges, Light: Science & Applications 13, 10.1038/s41377-024-01590-3 (2024). Wu et al. [2025] W. Wu, Y. Wang, G. Xiao, H. Peng, and Y. Fu, Retrieval head mechanistically explains long-context factuality, in International Conference on Learning Representations (ICLR) (2025) oral presentation. Xiao et al. [2025] G. Xiao, J. Tang, J. Zuo, J. Guo, S. Yang, H. Tang, Y. Fu, and S. Han, DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads, in International Conference on Learning Representations (ICLR) (2025). Tang et al. [2024a] H. Tang, Y. Lin, J. Lin, Q. Han, S. Hong, Y. Yao, and G. Wang, RazorAttention: Efficient KV cache compression through retrieval heads, arXiv preprint arXiv:2407.15891 (2024a). Tang et al. [2024b] J. Tang, Y. Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han, Quest: Query-aware sparsity for efficient long-context LLM inference, in International Conference on Machine Learning (ICML) (2024). Behnam et al. [2025] P. Behnam, Y. Fu, R. Zhao, P.-A. Tsai, Z. Yu, and A. Tumanov, RocketKV: Accelerating long-context LLM inference via two-stage KV cache compression, in International Conference on Machine Learning (ICML) (2025). Xiao et al. [2024a] C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang, Z. Liu, S. Han, and M. Sun, InfLLM: Training-free long-context extrapolation for LLMs with an efficient context memory, arXiv preprint arXiv:2402.04617 (2024a). Liu et al. [2025a] X. Liu, Z. Tang, P. Dong, Z. Li, B. Li, X. Hu, and X. Chu, ChunkKV: Semantic-preserving KV cache compression for efficient long-context LLM inference, in Advances in Neural Information Processing Systems (NeurIPS) (2025). Li et al. [2025a] H. Li, Y. Li, A. Tian, T. Tang, Z. Xu, X. Chen, N. Hu, W. Dong, Q. Li, and L. Chen, A survey on large language model acceleration based on KV cache management, Transactions on Machine Learning Research (2025a). Zhang et al. [2023] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett, Z. Wang, and B. Chen, H2O: Heavy-hitter oracle for efficient generative inference of large language models, in Advances in Neural Information Processing Systems (NeurIPS) (2023). Xiao et al. [2024b] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, Efficient streaming language models with attention sinks, in International Conference on Learning Representations (ICLR) (2024). Hooper et al. [2024] C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami, KVQuant: Towards 10 million context length LLM inference with KV cache quantization, in Advances in Neural Information Processing Systems (NeurIPS) (2024). Liu et al. [2025b] G. Liu, C. Li, Z. Ning, J. Lin, Y. Yao, D. Ke, M. Guo, and J. Zhao, FreeKV: Boosting KV cache retrieval for efficient LLM inference, arXiv preprint arXiv:2505.13109 (2025b). Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems, Vol. 30 (2017) p. 5998–6008. Ainslie et al. [2023] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai, GQA: Training generalized multi-query transformer models from multi-head checkpoints, in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2023) p. 4895–4901. Shazeer [2019] N. Shazeer, Fast transformer decoding: One write-head is all you need, arXiv preprint arXiv:1911.02150 (2019). Reck et al. [1994] M. Reck, A. Zeilinger, H. J. Bernstein, and P. Bertani, Experimental realization of any discrete unitary operator, Physical Review Letters 73, 58 (1994). Clements et al. [2016] W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, and I. A. Walmsley, Optimal design for universal multiport interferometers, Optica 3, 1460 (2016). Shen et al. [2017] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljačić, Deep learning with coherent nanophotonic circuits, Nature Photonics 11, 441 (2017). Xu et al. [2005] Q. Xu, B. Schmidt, S. Pradhan, and M. Lipson, Micrometre-scale silicon electro-optic modulator, Nature 435, 325 (2005). Zhou et al. [2022] H. Zhou, J. Dong, J. Cheng, W. Dong, C. Huang, Y. Shen, Q. Zhang, M. Gu, C. Qian, H. Chen, Z. Ruan, and X. Zhang, Photonic matrix multiplication lights up photonic accelerator and beyond, Light: Science & Applications 11, 30 (2022). Huang et al. [2020] C. Huang, S. Bilodeau, T. F. de Lima, A. N. Tait, P. Y. Ma, E. C. Blow, A. Jha, H.-T. Peng, B. J. Shastri, and P. R. Prucnal, Demonstration of scalable microring weight bank control for large-scale photonic integrated circuits, APL Photonics 5, 040803 (2020). Zhang et al. [2025a] H. Zhang, Y. Song, S. Chen, Y. Bai, X. Xu, C. Huang, J. Wang, H. Chen, D. J. Moss, and K. Xu, Integrated platforms and techniques for photonic neural networks, npj Nanophotonics 2, 40 (2025a). Johnson and Lindenstrauss [1984] W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, Contemporary Mathematics 26, 189 (1984). Wang et al. [2018] C. Wang, M. Zhang, X. Chen, M. Bertrand, A. Shams-Ansari, S. Chandrasekhar, P. Winzer, and M. Lončar, Integrated lithium niobate electro-optic modulators operating at CMOS-compatible voltages, Nature 562, 101 (2018). Shastri et al. [2021] B. J. Shastri, A. N. Tait, T. Ferreira de Lima, W. H. P. Pernice, H. Bhaskaran, C. D. Wright, and P. R. Prucnal, Photonics for artificial intelligence and neuromorphic computing, Nature Photonics 15, 102 (2021). Totovic et al. [2022] A. Totovic, G. Giamougiannis, A. Tsakyridis, D. Lazovsky, and N. Pleros, Programmable photonic neural networks combining WDM with coherent linear optics, Scientific Reports 12, 5605 (2022). Lischke et al. [2021] S. Lischke, A. Peczek, J. S. Morgan, K. Sun, D. Steckler, Y. Yamamoto, F. Korndörfer, C. Mai, S. Marschmeyer, M. Fraschke, A. Krüger, A. Beling, and L. Zimmermann, Ultra-fast germanium photodiode with 3-dB bandwidth of 265 GHz, Nature Photonics 15, 925 (2021). Peserico et al. [2023] N. Peserico, B. J. Shastri, and V. J. Sorger, Integrated photonic tensor processing unit for a matrix multiply: A review, Journal of Lightwave Technology 41, 3704 (2023). Feldmann et al. [2021] J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, and H. Bhaskaran, Parallel convolutional processing using an integrated photonic tensor core, Nature 589, 52 (2021). Hu et al. [2025] Y. Hu, Y. Song, X. Zhu, X. Guo, S. Lu, Q. Zhang, L. He, C. A. A. Franken, K. Powell, H. Warner, et al., Integrated lithium niobate photonic computing circuit based on efficient and high-speed electro-optic conversion, Nature Communications 16, 8178 (2025). Zhu et al. [2024b] X. Zhu, Y. Hu, S. Lu, H. K. Warner, X. Li, Y. Song, L. S. Magalhães, A. Shams-Ansari, N. Sinclair, and M. Lončar, Twenty-nine million intrinsic Q-factor monolithic microresonators on thin-film lithium niobate, Photonics Research 12, A63 (2024b). Tait et al. [2017] A. N. Tait, T. F. de Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, Neuromorphic photonic networks using silicon photonic weight banks, Scientific Reports 7, 7430 (2017). Bogaerts et al. [2012] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De Vos, S. Kumar Selvaraja, T. Claes, P. Dumon, P. Bienstman, D. Van Thourhout, and R. Baets, Silicon microring resonators, Laser & Photonics Reviews 6, 47 (2012). Ferreira de Lima et al. [2022] T. Ferreira de Lima, E. A. Doris, S. Bilodeau, W. Zhang, A. Jha, H.-T. Peng, E. C. Blow, C. Huang, A. N. Tait, B. J. Shastri, and P. R. Prucnal, Design automation of photonic resonator weights, Nanophotonics 11, 3805 (2022). Padmaraju and Bergman [2014] K. Padmaraju and K. Bergman, Resolving the thermal challenges for silicon microring resonator devices, Nanophotonics 3, 269 (2014). Sunny et al. [2021] F. Sunny, A. Mirza, M. Nikdast, and S. Pasricha, CrossLight: A cross-layer optimized silicon photonic neural network accelerator, in Proc. 58th ACM/IEEE Design Automation Conference (DAC) (2021) p. 1069–1074. Choquette [2023] J. Choquette, NVIDIA Hopper H100 GPU: Scaling performance, IEEE Micro 43, 9 (2023). Qwen Team [2024] Qwen Team, Qwen2.5 technical report, arXiv preprint arXiv:2412.15115 (2024). Qwen Team [2025b] Qwen Team, Qwen3 technical report, arXiv preprint arXiv:2505.09388 (2025b). Indyk and Motwani [1998] P. Indyk and R. Motwani, Approximate nearest neighbors: Towards removing the curse of dimensionality, in Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC) (1998) p. 604–613. Manning et al. [2008] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval (Cambridge University Press, 2008). Kamradt [2023] G. Kamradt, Needle in a haystack — pressure testing LLMs, https://github.com/gkamradt/LLMTest_NeedleInAHaystack (2023). Shekhar et al. [2024] S. Shekhar, W. Bogaerts, L. Chrostowski, J. E. Bowers, M. Hochberg, R. Soref, and B. J. Shastri, Roadmapping the next generation of silicon photonics, Nature Communications 15, 751 (2024). Seok et al. [2019] T. J. Seok, K. Kwon, J. Henriksson, J. Luo, and M. C. Wu, Wafer-scale silicon photonic switches beyond die size limit, Optica 6, 490 (2019). Dong [2016] P. Dong, Silicon photonic integrated circuits for wavelength-division multiplexing applications, IEEE Journal of Selected Topics in Quantum Electronics 22, 370 (2016). Bai et al. [2023] Y. Bai, X. Xu, M. Tan, Y. Sun, Y. Li, J. Wu, R. Morandotti, A. Mitchell, K. Xu, and D. J. Moss, Photonic multiplexing techniques for neuromorphic computing, Nanophotonics 12, 795 (2023). Johnson et al. [2021] J. Johnson, M. Douze, and H. Jégou, Billion-scale similarity search with GPUs, IEEE Transactions on Big Data 7, 535 (2021). Aubrey [2026] K. Aubrey, Inside the NVIDIA Rubin platform: Six new chips, one AI supercomputer, NVIDIA Developer Blog (2026). Tossoun et al. [2024] B. Tossoun, D. Liang, S. Cheung, Z. Fang, X. Sheng, J. P. Strachan, and R. G. Beausoleil, High-speed and energy-efficient non-volatile silicon photonic memory based on heterogeneously integrated memresonator, Nature Communications 15, 551 (2024). Adya et al. [2025] U. Adya, S. Singhal, R. Chen, I.-T. Chen, S. Joshi, A. Majumdar, M. Li, and S. Moazeni, Non-volatile tuning of cryogenic silicon photonic micro-ring modulators, Nature Communications 16, 9290 (2025). Fayza et al. [2025] F. Fayza, C. Demirkiran, S. P. Rao, D. Bunandar, U. Gupta, and A. Joshi, Photonics for sustainable AI, Communications Physics 8, 10.1038/s42005-025-02300-0 (2025). Li et al. [2025b] Y. Li, H. Jiang, Q. Wu, X. Luo, S. Ahn, C. Zhang, A. H. Abdi, D. Li, J. Gao, Y. Yang, and L. Qiu, SCBench: A KV cache-centric analysis of long-context methods, in International Conference on Learning Representations (ICLR) (2025). Zhang et al. [2025b] W. Zhang, F. Yin, H. Yen, D. Chen, and X. Ye, Query-focused retrieval heads improve long-context reasoning and re-ranking, in Conference on Empirical Methods in Natural Language Processing (EMNLP) (2025) p. 23791–23805. Park and Park [2025] H. Park and Y. Park, PRISM: Photonic retrieval-index similarity module — experiment code, https://github.com/hyoseokp/PRISM (2025), source code for retrieval-head identification, signature generation, recall measurement, hardware-aware simulation, and downstream evaluation. Saleh and Teich [2019] B. E. A. Saleh and M. C. Teich, Fundamentals of Photonics, 3rd ed. (John Wiley & Sons, 2019). Personick [1973] S. D. Personick, Receiver design for digital fiber optic communication systems, I, Bell System Technical Journal 52, 843 (1973). Jayatilleka et al. [2016] H. Jayatilleka, K. Murray, M. Caverley, N. A. F. Jaeger, L. Chrostowski, and S. Shekhar, Crosstalk in SOI microring resonator-based filters, Journal of Lightwave Technology 34, 2886 (2016). International Telecommunication Union [2020] International Telecommunication Union, Spectral grids for WDM applications: DWDM frequency grid, ITU-T Recommendation G.694.1 (2020). Boeck et al. [2010] R. Boeck, N. A. F. Jaeger, N. Rouger, and L. Chrostowski, Series-coupled silicon racetrack resonators and the Vernier effect: theory and measurement, Optics Express 18, 25151 (2010).