Paper deep dive
Toward a Theory of Hierarchical Memory for Language Agents
Yashar Talebirad, Ali Parsaee, Csongor Y. Szepesvari, Amirhossein Nadiri, Osmar Zaiane
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/26/2026, 2:30:15 AM
Summary
The paper proposes a unifying theoretical framework for hierarchical memory in language agents, decomposing existing systems into a three-operator pipeline: Extraction (α), Coarsening (C), and Traversal (τ). It introduces the 'self-sufficiency' spectrum for representative functions and the 'coarsening-traversal (C–T) coupling' constraint, which dictates how memory systems should be designed based on the information-preserving quality of their representatives.
Entities (7)
Relation Signals (3)
Hierarchical Memory → decomposedinto → Extraction (α), Coarsening (C), Traversal (τ)
confidence 95% · We propose a unifying theory in terms of three operators: Extraction (α), coarsening (C), and traversal (τ).
RAPTOR → implements → Hierarchical Memory
confidence 95% · Hierarchical memory is a recurring response: RAPTOR (Sarthi et al., 2024)...
Self-sufficiency → constrains → Traversal (τ)
confidence 90% · We identify a self-sufficiency spectrum for the representative function ρ and show how it constrains viable retrieval strategies
Cypher Suggestions (2)
Find all systems that implement a specific operator · confidence 85% · unvalidated
MATCH (s:System)-[:IMPLEMENTS]->(o:Operator {name: 'Traversal'}) RETURN s.nameMap the relationship between systems and their memory type · confidence 80% · unvalidated
MATCH (s:System)-[r:USES_MEMORY_TYPE]->(t:MemoryType) RETURN s.name, t.name
Abstract
Abstract:Many recent long-context and agentic systems address context-length limitations by adding hierarchical memory: they extract atomic units from raw data, build multi-level representatives by grouping and compression, and traverse this structure to retrieve content under a token budget. Despite recurring implementations, there is no shared formalism for comparing design choices. We propose a unifying theory in terms of three operators. Extraction ($\alpha$) maps raw data to atomic information units; coarsening ($C = (\pi, \rho)$) partitions units and assigns a representative to each group; and traversal ($\tau$) selects which units to include in context given a query and budget. We identify a self-sufficiency spectrum for the representative function $\rho$ and show how it constrains viable retrieval strategies (a coarsening-traversal coupling). Finally, we instantiate the decomposition on eleven existing systems spanning document hierarchies, conversational memory, and agent execution traces, showcasing its generality.
Tags
Links
- Source: https://arxiv.org/abs/2603.21564v1
- Canonical: https://arxiv.org/abs/2603.21564v1
Full Text
36,258 characters extracted from source content.
Expand or collapse full text
Toward a Theory of Hierarchical Memory for Language Agents Yashar Talebirad University of Alberta talebira@ualberta.ca &Ali Parsaee University of Alberta Parsaee@ualberta.ca &Csongor Y. Szepesvari University of Alberta csongor@ualberta.ca &Amirhossein Nadiri York University anadiri@yorku.ca &Osmar Zaiane University of Alberta zaiane@ualberta.ca Abstract Many recent long-context and agentic systems address context-length limitations by adding hierarchical memory: they extract atomic units from raw data, build multi-level representatives by grouping and compression, and traverse this structure to retrieve content under a token budget. Despite recurring implementations, there is no shared formalism for comparing design choices. We propose a unifying theory in terms of three operators. Extraction (α) maps raw data to atomic information units; coarsening (C=(π,ρ)C=(π,ρ)) partitions units and assigns a representative to each group; and traversal (τ) selects which units to include in context given a query and budget. We identify a self-sufficiency spectrum for the representative function ρ and show how it constrains viable retrieval strategies (a coarsening-traversal coupling). Finally, we instantiate the decomposition on eleven existing systems spanning document hierarchies, conversational memory, and agent execution traces, showcasing its generality. 1 Introduction Language agents increasingly rely on large external corpora and long interaction histories, but larger context windows do not reliably improve information use. Models underweight mid-context evidence (Liu et al., 2024), struggle with single-fact retrieval in long documents (Hsieh et al., 2024), and often degrade as context grows (“context dilution”/“context rot”). This reveals a gap between capacity (more tokens) and control (what to surface, and when). Hierarchical memory is a recurring response: RAPTOR (Sarthi et al., 2024), GraphRAG (Edge et al., 2024), xMemory (Hu et al., 2026), H-MEM (Sun and Zeng, 2025), and SimpleMem (Liu et al., 2026) build multi-level representations by grouping, compressing, and selectively expanding under a budget; analogous ideas appear in agent execution-trace memory, including reasoning-focused traces (MemoBrain (Qian et al., 2026), StackPlanner (Zhang et al., 2026)) and action/state traces (InfiAgent (Yu et al., 2026a)). Yet these systems are usually presented as isolated design points, making it hard to compare assumptions. We show that, every hierarchical memory system, whether it operates on stored knowledge or live agent execution traces, instantiates a single three-operator pipeline (Figure 1). Extraction (α) maps raw data to a graph of atomic information units. Coarsening (C=(π,ρ)C=(π,ρ)) partitions units into groups and produces a representative for each group, yielding a smaller graph, and iterating this produces the hierarchy. Traversal (τ) takes the hierarchy, a query, and token budget, and returns a subset of atomic units to include in context. The central theoretical observation is that the representative function ρ varies along a self-sufficiency spectrum: a detailed summary preserves most of the group’s information, while a category label preserves almost none. This property constrains which retrieval strategy works. Self-sufficient representatives support collapsed search over all levels (RAPTOR), while referential representatives require top-down refinement (H-MEM, xMemory). We call this constraint the coarsening-traversal (C–T) coupling. Prior efforts classify memory systems but do not formalize hierarchy construction. CoALA (Sumers et al., 2024) organizes memory by cognitive type (working, episodic, semantic, procedural), and MemEngine (Zhang et al., 2025b) implements multiple systems from shared primitives. We provide a mathematical unification, extended to agent execution traces: a formal (α,C,τ)(α,C,τ) decomposition with information monotonicity, a self-sufficiency spectrum with C–T coupling, and an instantiation of this decomposition across eleven data and trace systems. Figure 1: The (α,C,τ)(α,C,τ) pipeline: D is extracted (α) into atoms G0G_0; coarsening C1,C2,C3C_1,C_2,C_3 yields layers G1,G2,G3G_1,G_2,G_3; traversal τ takes a query and budget and returns a subset S of atoms. 2 The Framework 2.1 Core Definitions Natural language exhibits multiscale structure (themes, topics, individual facts), and queries may target any scale. We define three operators that build multiresolution representations of this structure. Information units. Let ℱ=ℱ1×⋯×ℱpF=F_1×·s×F_p be a product of feature spaces with one distinguished factor ℱc=Σ∗F_c= ^* (text). A unit u∈ℱu has projections ϕi(u) _i(u); ϕc(u) _c(u) is the content. Typical attributes include embeddings ϕemb _emb, timestamps ϕtime _time, entities ϕent _ent, and structural paths ϕpath _path. Unit graph. G=(U,E)G=(U,E) with U a finite set of units and E⊆U×UE U× U. The edge set may be empty (a flat bag of units, as in SimpleMem), may encode entity-entity relationships (a knowledge graph, as in GraphRAG), or may encode sequential adjacency (a conversation stream, as in xMemory). Extraction. The first operator maps raw data into structured units. α:→α:D maps raw data D to a unit graph G0=(U0,E0)G_0=(U_0,E_0). All preprocessing (chunking, named entity recognition, path assignment) happens inside α, and thus, each unit is self-contained in the resulting graph G0G_0. Coarsening. A coarsening operator on G=(U,E)G=(U,E) is C=(π,ρ)C=(π,ρ): (i) π:U↠[m]π:U [m] is a surjective grouping that induces the partition Gj=π−1(j)\G_j=π^-1(j)\; (i) ρ:2U→ℱρ:2^U is a representative function such that ρ(Gj)∈ℱρ(G_j) . We assume |U′|=m<|U||U |=m<|U|, where U′=ρ(Gj):j∈[m]U =\ρ(G_j):j∈[m]\.111Some systems (RAPTOR’s GMM soft clustering, overlapping community detection) allow a unit to belong to multiple groups, replacing π:U→[m]π:U→[m] with π:U→2[m]∖∅π:U→ 2^[m] ; the hierarchy then becomes a DAG. Hierarchy. Iterating CiC_i produces a hierarchy ℋ=(G0,C1,…,CL)H=(G_0,C_1,…,C_L) with Gℓ=Cℓ(Gℓ−1)G_ =C_ (G_ -1).222ℋH lists the initial graph G0G_0 and the coarsening operators; the graphs G1,…,GLG_1,…,G_L are the results of applying them, not separate entries in the tuple. The “layers” in the multi-level view are these derived graphs. Denote the node set of GℓG_ as VℓV_ . Level 0 is thus the atom set V0V_0; each application of CℓC_ builds the next layer VℓV_ from Vℓ−1V_ -1 by grouping nodes (via πℓ _ ) and assigning one representative per group (via ρℓ _ ), with edges from each representative to its children. The hierarchy can thus be viewed as a multi-layer graph: the bottom layer holds fine-grained units, and each higher layer is a coarsened graph whose nodes summarize the subgraph below (Figure 1). Each CℓC_ may use a different grouping and representative strategy (e.g., xMemory uses temporal grouping at level 1, semantic grouping at level 2, and thematic grouping at level 3). 2.2 Multi-Resolution Representation Assume the input data D is drawn from a distribution over D. Since V0=α(D)V_0=α(D) and each Vℓ=Cℓ(Vℓ−1)V_ =C_ (V_ -1), every level inherits randomness from D, and the entropies below are well defined. Let ℋ=(G0,C1,…,CL)H=(G_0,C_1,…,C_L), and assume each CℓC_ is deterministic and information-losing with positive probability under D (equivalently, not almost surely invertible). Then: (i) H(V0)>H(V1)>⋯>H(VL)H(V_0)>H(V_1)>·s>H(V_L), since determinism gives H(Vℓ)≤H(Vℓ−1)H(V_ )≤ H(V_ -1) and strict information loss on a nonzero-probability set makes the inequality strict; (i) for any query variable Q with query-agnostic coarsening (i.e., VℓV_ depends only on Vℓ−1V_ -1 and not on Q), I(Q;V0)≥I(Q;V1)≥⋯≥I(Q;VL)I(Q;V_0)≥ I(Q;V_1)≥·s≥ I(Q;V_L), since learning VℓV_ provides no additional information about Q once Vℓ−1V_ -1 is known; equivalently, Q–Vℓ−1V_ -1–VℓV_ forms a Markov chain (and the Data Processing Inequality applies). This formalizes that every coarsening step loses information.333This also aligns with the Information Bottleneck principle (Tishby et al., 1999): as compression becomes more aggressive at each level of ℋH, representations move along a trade-off curve between retained detail and compactness, suggesting a multi-resolution structure. The atoms V0V_0 set the information ceiling; finer extraction increases H(V0)H(V_0) and raises the ceiling on downstream retrieval quality.444Appendix B relates data size, context window, and compression ratio to the number of levels needed. 2.3 Self-Sufficiency and the C–T Coupling The representative function ρ is where systems diverge most, and its properties determine the behavior of the entire hierarchy. The self-sufficiency of ρ at group GjG_j measures how much of the group’s information the representative preserves: 555A computable LLM-based SθS_θ and a query-dependent refinement SSQS_Q are developed in Appendix A. S(ρ,Gj)=I(Gj;ρ(Gj))H(Gj)=1−H(Gj∣ρ(Gj))H(Gj).S(ρ,G_j)= I(G_j;ρ(G_j))H(G_j)=1- H(G_j ρ(G_j))H(G_j). (1) Assume H(Gj)>0H(G_j)>0; if H(Gj)=0H(G_j)=0, define S=1S=1. When self-sufficiency is high (close to 1), the representative preserves most of the group’s content and can answer queries on its own (e.g., LLM-generated summaries). When self-sufficiency is low (close to 0), the representative serves only as a routing label, indicating what the group contains without reproducing it (e.g., domain names). For any query Q, I(Q;Gj)−I(Q;ρ(Gj))≤H(Gj∣ρ(Gj))I(Q;G_j)-I(Q;ρ(G_j))≤ H(G_j ρ(G_j)), so high self-sufficiency bounds the information loss incurred by reading the representative rather than the children. C–T coupling. Self-sufficiency constrains which traversal strategies are viable, reflecting a rate-distortion tradeoff. Self-sufficient ρ operates at low distortion: coarse nodes carry enough information to answer queries directly, so collapsed search suffices (single-shot decoding; RAPTOR). Referential ρ operates at high distortion: coarse nodes serve only as routing labels, so top-down refinement is needed to recover the lost detail (successive refinement; H-MEM, xMemory). Mismatching ρ and traversal wastes budget: collapsed search over referential representatives spends tokens on uninformative labels, while top-down routing through self-sufficient representatives spends tokens on unnecessary expansion. Existing evidence is consistent with this observation, but a controlled comparison across the spectrum remains open. 2.4 Traversal A traversal τ:(ℋ,q,B)→S⊆V0τ:(H,q,B)→ S V_0 takes a hierarchy, query, and token budget and returns atoms satisfying ∑u∈S|ϕc(u)|≤B _u∈ S| _c(u)|≤ B. By pruning via hierarchy structure, traversal can reduce relevance evaluations from O(n)O(n) (flat search) to O(logn)O( n) under bounded compute. Four patterns recur: top-down refinement (H-MEM, xMemory), collapsed search across levels (RAPTOR), multi-view parallel retrieval (SimpleMem), and reasoning-based navigation (PageIndex (Zhang et al., 2025a), MemWalker (Chen et al., 2023)).666Pseudocode appears in Appendix D. This reflects a dual role of ℋH: with referential ρ, it mainly serves as an index for top-down routing; with self-sufficient ρ, it also serves as a representation whose non-leaf nodes often answer broad queries directly (collapsed search). 3 Instantiation 3.1 Data and Trace Systems To showcase generality, Table 1 maps eleven systems to (α,C,τ)(α,C,τ). Data memory systems operate on stored content: batch document hierarchies (RAPTOR, GraphRAG), online conversational memory (xMemory, H-MEM, SimpleMem), observational compression (Mastra’s Observational Memory (Barnes, 2026)), and structured navigation (PageIndex). Agent execution-trace systems operate on execution: α segments traces, C groups/compresses them (e.g., MemoBrain’s Fold/Flush, StackPlanner’s Condensation/Pruning), and τ reconstructs working context. In trace memory, coherence is causal-functional (steps for the same subproblem) rather than semantic, and the training signal is task reward (DPO, GRPO) rather than retrieval metrics. AgeMem (Yu et al., 2026b) exposes store/retrieve/summarize/filter/discard as tool actions, while InfiAgent uses bounded reconstruction from workspace state ℱtF_t and recent actions (Yu et al., 2026a).777Bounded context forces a scope–nuance trade-off, which explains convergence on “capture at full granularity, coarsen iteratively, reconstruct on demand via τ”. Table 1: Data and agent execution-trace systems as (α,C,τ)(α,C,τ). Type: D = data, T = trace. System Type α π; ρ (S) τ RAPTOR D 100-tok chunks UMAP+GMM; LLM summary (high) collapsed / top-down GraphRAG D entity/rel extract Leiden; community report (high) global / entity fan-out xMemory D episode boundaries guided split/merge; fact distill (mid) top-down + expand H-MEM D episode + profile LLM 4-level; domain labels (low) top-down FAISS SimpleMem D window + density online synthesis; consolidated (high) multi-view parallel Mastra’s OM D token threshold all-in-one; Reflector (high) full log PageIndex D structural parse doc structure; titles/summaries (mid) reasoning-based MemoBrain T episode → thought dep.-subgraph; Fold (high) / Flush (low) context projection StackPlanner T task stack entries segment; Condensation (high) / Pruning (low) stack top + retrieval AgeMem T turns → entries learned; SUMMARY (high) / FILTER (low) RETRIEVE + FILTER InfiAgent T artifacts → files state consolidation (mid) bounded g(ℱt,at−k:t−1)g(F_t,a_t-k:t-1) 4 Discussion and Future Work The (α,C,τ)(α,C,τ) decomposition provides a common language for comparing systems. Table 1 shows that data-memory and agent-trace systems converge on the same pipeline despite different motivations. The C–T coupling suggests that ρ and τ should be chosen jointly, and the Fano bound (Appendix B) quantifies how representative quality caps branching factor. Current grouping functions π mostly assume unweighted, deterministic edges, but real unit graphs are often weighted or uncertain. Local community mining and search methods for weighted (SIWOw (Zafarmand et al., 2023)) and uncertain (USIWO (Talebirad et al., 2023)) graphs motivate extending π to exploit edge strength/confidence and improve partition coherence (Appendix C). As agent execution traces grow, most prior steps become irrelevant to the current subtask, creating a similar context pressure that motivates hierarchical organization of stored knowledge. This convergence suggests that complex agentic tasks involve three coupled hierarchies: the data hierarchy ℋdataH_data (stored knowledge), the trace hierarchy ℋtraceH_trace (reasoning, interaction, and action traces), and the task hierarchy ℋtaskH_task (problem decomposition). Each node in ℋtaskH_task induces a traversal through ℋdataH_data and generates nodes in ℋtraceH_trace; the inter-hierarchy alignment constraints remain open. The central limitation is the assumption of a static hierarchy. In practice, hierarchies evolve via insertion/restructuring (xMemory’s split/merge, MemTree (Rezazadeh et al., 2025)), and retrieval can reshape memory by reinforcement or decay. Both dynamics break the Markov-chain argument of Section 2.2: query-conditioned ρ makes VℓV_ depend on Q, and stateful coarsening couples τ and C into a feedback loop. Formalizing this setting (likely via adaptive information theory or online learning) is the most pressing open problem. Empirical next steps are validating SθS_θ as a predictor of retrieval quality, testing C–T coupling with matched ρ/τρ/τ variants, and computing IB-optimal ρ. References T. Barnes (2026) Observational memory: 95% on longmemeval. Note: Mastra Research BlogObservational Memory (OM) External Links: Link Cited by: §3.1. H. Chen, R. Pasunuru, J. Weston, and A. Celikyilmaz (2023) Walking down the memory maze: beyond context limit through interactive reading. arXiv preprint arXiv:2310.05029. Note: MemWalker; Meta AI Research Cited by: §2.4. D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, and J. Larson (2024) From local to global: a graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130. Cited by: §1. C. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y. Zhang, and B. Ginsburg (2024) RULER: what’s the real context size of your long-context language models?. In Conference on Language Modeling (COLM), Note: arXiv:2404.06654 Cited by: §1. Z. Hu, Q. Zhu, H. Yan, Y. He, and L. Gui (2026) Beyond RAG for agent memory: retrieval by decoupling and aggregation. arXiv preprint arXiv:2602.02007. Cited by: Appendix B, §1. J. Liu, Y. Su, P. Xia, S. Han, Z. Zheng, C. Xie, M. Ding, and H. Yao (2026) SimpleMem: efficient lifelong memory for LLM agents. arXiv preprint arXiv:2601.02553. Cited by: §1. N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang (2024) Lost in the middle: how language models use long contexts. Transactions of the Association for Computational Linguistics 12, p. 157–173. Note: arXiv:2307.03172 Cited by: §1. H. Qian, Z. Cao, and Z. Liu (2026) MemoBrain: executive memory as an agentic brain for reasoning. arXiv preprint arXiv:2601.08079. Cited by: §1. A. Rezazadeh, Z. Li, W. Wei, and Y. Bao (2025) From isolated conversations to hierarchical schemas: dynamic tree memory representation for LLMs. In International Conference on Learning Representations, Note: MemTree; arXiv:2410.14052 Cited by: §4. P. Sarthi, S. Abdullah, A. Tuli, S. Khanna, A. Goldie, and C. D. Manning (2024) RAPTOR: recursive abstractive processing for tree-organized retrieval. In International Conference on Learning Representations, Cited by: §1. T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths (2024) Cognitive architectures for language agents. Transactions on Machine Learning Research. Note: arXiv:2309.02427 Cited by: §1. H. Sun and S. Zeng (2025) Hierarchical memory for high-efficiency long-term reasoning in LLM agents. arXiv preprint arXiv:2507.22925. Cited by: §1. Y. Talebirad, M. Zafarmand, O. R. Zaiane, and C. Largeron (2023) USIWO: a local community search algorithm for uncertain graphs. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining, p. 187–194. Cited by: §4. N. Tishby, F. C. Pereira, and W. Bialek (1999) The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, Cited by: footnote 3. C. Yu, Y. Wang, S. Wang, H. Yang, and M. Li (2026a) InfiAgent: an infinite-horizon framework for general-purpose autonomous agents. arXiv preprint arXiv:2601.03204. Cited by: §1, §3.1. Y. Yu, L. Yao, Y. Xie, Q. Tan, J. Feng, Y. Li, and L. Wu (2026b) Agentic memory: learning unified long-term and short-term memory management for large language model agents. arXiv preprint arXiv:2601.01885. Note: AgeMem Cited by: §3.1. M. Zafarmand, Y. Talebirad, E. Austin, C. Largeron, and O. R. Zaïane (2023) Fast local community discovery relying on the strength of links. Social Network Analysis and Mining 13 (1), p. 112. Cited by: Appendix C, §4. M. Zhang, Y. Tang, and P. Team (2025a) PageIndex: next-generation vectorless, reasoning-based rag. PageIndex Blog. Note: https://pageindex.ai/blog/pageindex-intro Cited by: §2.4. R. Zhang, X. Jiang, Z. Yang, Z. Zhang, J. Gao, Y. Xiao, H. Lai, X. Chu, J. Zhao, and Y. Wang (2026) StackPlanner: a centralized hierarchical multi-agent system with task-experience memory management. arXiv preprint arXiv:2601.05890. Cited by: §1. Z. Zhang, Q. Dai, X. Chen, R. Li, Z. Li, and Z. Dong (2025b) MemEngine: a unified and modular library for developing advanced memory of LLM-based agents. arXiv preprint arXiv:2505.02099. Cited by: §1. Appendix A Alternate Self-Sufficiency Measures To make the C–T coupling testable in practice, we need proxies for self-sufficiency that can be estimated on real systems. The measures below separate what can be assessed at construction time from what is revealed only under concrete queries. Computable proxy (SθS_θ). Shannon entropy is not directly computable for natural language, so we use an LLM PθP_θ as a practical proxy: Sθ(ρ,Gj)= 1−logPθ(Gj∣ρ(Gj))−logPθ(Gj).S_θ(ρ,G_j)\;=\;1- - P_θ(G_j ρ(G_j))- P_θ(G_j). (2) Intuitively, this measures how much the representative helps the model predict the group’s contents. Since many systems use LLMs to generate representatives, SθS_θ directly affects downstream quality. If ρ hallucinates (adds facts not supported by the children), then Pθ(Gj∣ρ)P_θ(G_j ρ) drops: the representative no longer predicts the actual children and may actively mislead. The same drop occurs if ρ blurs important distinctions (e.g., by merging dissimilar items into a vague summary), because the representative fails to disambiguate. These failure modes suggest that SθS_θ is a reasonable signal of the reliability of ρ. The decision rule linking S to traversal choice (Section 2.3) holds only when SθS_θ indicates that ρ is reliable. When SθS_θ is low (due to aggressive compression, hallucination, or incoherent grouping), the system should default to treating ρ as referential, regardless of the nominal compression ratio. Query-dependent refinement (SSQS_Q). Both S and SθS_θ are query-independent: they measure the fraction of a group’s total information preserved by ρ. A query Q may need only a small portion of GjG_j’s information. The query-dependent self-sufficiency SSQ(ρ,Gj,Q)=I(Q;ρ(Gj))I(Q;Gj)S_Q(ρ,G_j,Q)\;=\; I(Q;\,ρ(G_j))I(Q;\,G_j) (3) measures the fraction of query-relevant information preserved by ρ(Gj)ρ(G_j) when I(Q;Gj)>0I(Q;G_j)>0 (if I(Q;Gj)=0I(Q;G_j)=0, define SSQ=1S_Q=1). Since ρ(Gj)ρ(G_j) is a deterministic function of GjG_j, the chain Q–GjG_j–ρ(Gj)ρ(G_j) is Markov and the Data Processing Inequality gives SSQ∈[0,1]S_Q∈[0,1]. Define query relevance rQ=I(Q;Gj)/H(Gj)r_Q=I(Q;G_j)/H(G_j). From I(Q;Gj∣ρ(Gj))≤H(Gj∣ρ(Gj))I(Q;G_j ρ(G_j))≤ H(G_j ρ(G_j)) and the identity SSQ=1−I(Q;Gj∣ρ(Gj))/I(Q;Gj)S_Q=1-I(Q;G_j ρ(G_j))\,/\,I(Q;G_j): SSQ≥ 1−1−SSrQ.S_Q\;≥\;1- 1-Sr_Q. (4) When rQr_Q is large, S is a tight lower bound on SSQS_Q. When rQr_Q is small, the bound does not provide a meaningful constraint, and low-S representatives can still achieve high SSQS_Q. This explains why referential representatives (category labels, hash keys, B-tree boundary keys) are effective for routing: the routing query requires very little of the group’s total information (rQ≈0r_Q≈ 0), so even S≈0S≈ 0 is compatible with SSQ≈1S_Q≈ 1. S is the appropriate construction-time metric, since no query is available when ρ is built. At design time, S determines the default traversal strategy via the C–T coupling (Section 2.3). At query time, SSQS_Q determines whether the representative suffices or the children must be expanded. Operationally, S is a prior over expected query-time behavior and should be calibrated on a target workload by comparing S against empirical means of SSQS_Q. Accordingly, the C–T coupling is a practical default: high S suggests more queries are answerable from the representative (collapsed search), while low S suggests more queries require expansion (top-down refinement). Appendix B Fano Bound and Branching Factor The Fano inequality constrains how many groups a level can contain given the discriminative quality of the representative. Theorem (Fano). Let Z∈1,…,nkZ∈\1,…,n_k\ be the correct group index and O the observable routing evidence. For any estimator Z^=g(O) Z=g(O) with error probability pe=P[Z^≠Z]p_e=P[ Z≠ Z]: pe≥1−I(Z;O)+1log2nk.p_e≥ 1- I(Z;O)+1 _2n_k. Corollary (from xMemory (Hu et al., 2026)). If I(Z;O)≤BI(Z;O)≤ B bits and pe≤εp_e≤ , then nk≤2(B+1)/(1−ε)n_k≤ 2^(B+1)/(1- ). In our framework, B is determined by the information content of ρ: low-S representatives (small B) force coarser partitions, while high-S representatives permit finer ones. Composing across L levels with per-level error εℓ _ , the total routing failure probability is bounded by ∑ℓεℓ _ _ (union bound). Optimal branching. Assuming a balanced tree, uniform per-node cost, and no caching, top-down retrieval with beam width k and branching factor b has total cost L⋅k⋅bL· k· b. With n=|V0|=bLn=|V_0|=b^L, substituting L=lnn/lnbL= n/ b yields cost ∝b/lnb b/ b, which is minimized at b=eb=e. The Fano constraint caps b at 2(B+1)/(1−ε)2^(B+1)/(1- ); if this is below e, the cap binds. The optimal depth is L∗=⌈lnn/lnb∗⌉L^*= n/ b^* . Depth bound. A related structural constraint governs hierarchy depth. If total data comprises N tokens, the context window holds C tokens, and each coarsening level compresses content by a factor r, the top level fits in context only when L≥⌈logr(N/C)⌉L≥ _r(N/C) . Appendix C Affinity and Coherence Traversal quality depends not only on how informative representatives are, but also on whether grouping preserves meaningful neighborhood structure. Here, we make the hidden assumption behind top-down pruning more explicit: grouping must preserve local relevance geometry well enough that parent-level decisions remain predictive at child level. The C–T coupling (Section 2.3) concerns the representative ρ; the grouping π also constrains traversal. Every hierarchical system implicitly assumes that units placed in the same group are in some sense alike, whether by embedding similarity (RAPTOR), graph connectivity (GraphRAG), or structural path (PageIndex). Without a principled notion of “which units belong together,” the partition is arbitrary and top-down pruning has no guarantee: a low-relevance parent might have highly relevant children in another topic that happened to be grouped with it. We formalize the notion of “alike” as an affinity and require that the partition respect it; that is what makes top-down traversal safe. The definitions below are an attempt to make this notion more precise. Definition 1 (Affinity) An affinity on a unit graph G=(U,E)G=(U,E) is a symmetric function W:U×U→ℝ≥0W:U× U _≥ 0. Definition 2 (W-coherent partition) A partition π:U↠[m]π:U [m] is W-coherent if within-group affinity exceeds between-group affinity: u,v:π(u)=π(v)[W(u,v)]>u,v:π(u)≠π(v)[W(u,v)].E_u,v:π(u)=π(v)[W(u,v)]>E_u,v:π(u)≠π(v)[W(u,v)]. Systems derive W from different signals: System Affinity W(u,v)W(u,v) Partition method RAPTOR cos(ϕemb(u),ϕemb(v)) ( _emb(u), _emb(v)) UMAP + GMM GraphRAG connectivity in E Leiden community detection H-MEM LLM-judged domain co-membership LLM 4-level classification xMemory cosine + sparsity-semantic score guided split/merge SimpleMem semantic + temporal proximity online LLM synthesis Mastra’s OM temporal contiguity token-count threshold PageIndex |lcp(ϕpath(u),ϕpath(v))||lcp( _path(u), _path(v))| structural parsing In Table 1, Leiden community mining (used in GraphRAG) is the clearest example of global graph partitioning. The graph-mining literature also studies community search, which identifies a local dense subgraph around given query nodes without processing the full graph (Zafarmand et al., 2023). This suggests a local variant of π: rather than committing to a single global partition at construction time, the system could perform ad-hoc, on-demand coarsening around the regions of the unit graph most relevant to the current query. This way, the quality of π depends directly on the fidelity of the graph signals used by the mining step. Furthermore, real-world unit graphs frequently carry edge weights or probabilistic edges encoding uncertainty in the underlying facts, as commonly arises in knowledge-graph predicates with uncertain truth values. Incorporating such richer signals into W can tighten the coherence gap γℓ=[Wwithin]−[Wbetween] _ =E[W_within]-E[W_between] and improve the quality of π. The affinity W is why π should group certain units together, but π need not compute W explicitly. In structural parsing, π reads off path prefixes directly. For agent execution traces, a symmetric pairwise affinity is often a poor fit: MemoBrain groups reasoning thoughts by causal co-resolution (thoughts connected via directed dependencies that jointly resolve a subproblem), and StackPlanner groups stack entries by contiguous stack membership. In both cases, the grouping criterion is structural and directed rather than similarity-based, so we omit these systems from the table. The underlying principle still holds, however: W-coherence is what makes top-down traversal efficient, and its analogue in trace systems is that steps addressing the same subproblem should cluster, ensuring that parent relevance predicts child relevance. The key intuition is relevance monotonicity: if a parent node has low relevance to the query, its children are unlikely to have high relevance, with the gap controlled by the coherence gap γℓ _ at that level. This justifies top-down pruning in H-MEM, xMemory, and MemWalker, and the same logic applies to MemoBrain’s Fold and Flush: if a summarized sub-trajectory is irrelevant to the current reasoning state, its constituent steps are also likely irrelevant. Systems with poor coherence (e.g., noisy LLM partitions) should use wider beams or prefer collapsed search. A formal characterization of how γℓ _ bounds child relevance conditioned on parent relevance remains open. Appendix D Traversal Algorithms The algorithms below operationalize the same design trade-off discussed in the main text: spend budget on broad routing first or on direct content evidence. Algorithm 1: Top-down refinement (H-MEM, xMemory). Given ℋH, query q, budget B, beam widths kL,…,k0k_L,…,k_0: set SL←Top-kL(VL,r(q,⋅))S_L -k_L(V_L,r(q,·)). For ℓ=L−1,…,0 =L-1,…,0: expand candidates←⋃v∈Sℓ+1children(v)candidates← _v∈ S_ +1children(v), then Sℓ←Top-kℓ(candidates,r(q,⋅))S_ -k_ (candidates,r(q,·)). Return S0S_0 truncated to budget B. Algorithm 2: Collapsed search (RAPTOR). Pool all nodes across all levels: pool←⋃ℓ=0LVℓpool← _ =0^LV_ . Retrieve S←Top-k(pool,r(q,⋅))S -k(pool,r(q,·)). Expand any non-leaf in S to its atoms. Return atoms within budget B. Algorithm 3: Reasoning-based navigation (PageIndex, MemWalker). Initialize context←ϕc(root(ℋ))context← _c(root(H)). While budget is not exhausted: the LLM selects a child node from context given q. If the node is a leaf, collect it. Otherwise, expand: context←context∪ϕc(children(node))context ∪ _c(children(node)). Return collected atoms. Algorithm 4: Multi-view parallel retrieval (SimpleMem). Decompose the query: (qsem,qlex,qsym,d)←Plan(q)(q_sem,q_lex,q_sym,d) (q). Set candidate limit n←f(d)n← f(d). Retrieve in parallel: Rsem←Top-n(V0,rsem(qsem,⋅))R_sem -n(V_0,r_sem(q_sem,·)), Rlex←Top-n(V0,rlex(qlex,⋅))R_lex -n(V_0,r_lex(q_lex,·)), Rsym←Top-n(V0,rsym(qsym,⋅))R_sym -n(V_0,r_sym(q_sym,·)). Return Rsem∪Rlex∪RsymR_sem∪ R_lex∪ R_sym truncated to budget B. Algorithm 4 uses a flat multi-view design: each view indexes the same atoms V0V_0 using a different signal (semantic similarity, lexical overlap, symbolic metadata), and taking the union ensures that a relevant atom missed by one view can still be recovered by another. More generally, a system can maintain multiple hierarchies ℋi\H_i\ over the same atom set V0V_0, each organizing atoms according to a different principle (e.g., one by temporal–thematic coherence, another by embedding similarity). This can be beneficial because any single grouping π is inevitably lossy: semantically related atoms may land in different branches, and the severity grows with partition granularity. A second hierarchy provides an alternative traversal path: an atom mis-partitioned in ℋ1H_1 may be correctly grouped in ℋ2H_2, so retrieval failure requires mis-partition in both views rather than just one. Multi-attribute queries benefit further because constraints of different types (e.g., topical relevance vs. temporal range) decompose naturally into per-hierarchy retrievals, which can be combined via set intersection for precision or union for recall; when constraints are independent, each retrieval can be parallelized. Appendix E Task and Agent Hierarchies In Section 4, we mention three coupled hierarchies and note that ℋtaskH_task has dynamics different from data memory. A useful extension is to separate two orthogonal structures: the task hierarchy (what to solve) and the agent hierarchy (which solver tier handles each part). Define task=(Vtask,Edep)G_task=(V_task,E_dep) as a directed acyclic graph, which is gradually constructed during execution to represent dependencies among tasks; its hierarchical, coarse-to-fine ordering is denoted as ℋtaskH_task. Let (A,≤A)(A, _A) represent a set of agent tiers, partially ordered by capability (and usually by cost as well). In this setup, taskG_task expresses the structure of the problem by specifying which subtasks depend on others, while (A,≤A)(A, _A) characterizes the solvers—showing which agent tiers are equipped to handle tasks of increasing difficulty and generality. Assignment of tasks to agents can then be handled by a routing map R:Vtask→AR:V_task→ A. To ensure coherence, a natural constraint is monotonicity: if t is an ancestor of t′t in taskG_task, then R(t)≥AR(t′)R(t) _AR(t ). This means that higher-level planning is given to more capable (typically more expensive) agents, while specific, lower-level execution is delegated to less capable tiers.