Paper deep dive
Governed Memory: A Production Architecture for Multi-Agent Workflows
Hamed Taheri
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/22/2026, 5:57:51 AM
Summary
Governed Memory is a four-layer infrastructure architecture designed to address the 'memory governance gap' in multi-agent enterprise AI systems. It provides a unified framework for shared memory, schema enforcement, tiered governance routing, and closed-loop quality feedback, enabling autonomous agents to operate with consistent organizational context, entity-scoped isolation, and reduced token redundancy.
Entities (6)
Relation Signals (5)
Governed Memory → includesmechanism → Dual Memory Model
confidence 100% · We present Governed Memory... through four mechanisms: a dual memory model...
Governed Memory → includesmechanism → Governance Routing
confidence 100% · We present Governed Memory... through four mechanisms: ... tiered governance routing...
Governed Memory → includesmechanism → Reflection-bounded retrieval
confidence 100% · We present Governed Memory... through four mechanisms: ... reflection-bounded retrieval...
Governed Memory → includesmechanism → Schema Lifecycle
confidence 100% · We present Governed Memory... through four mechanisms: ... a closed-loop schema lifecycle...
Governed Memory → performson → LoCoMo
confidence 95% · On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy
Cypher Suggestions (2)
Find all mechanisms associated with the Governed Memory architecture. · confidence 95% · unvalidated
MATCH (a:Architecture {name: 'Governed Memory'})-[:INCLUDES_MECHANISM]->(m:Mechanism) RETURN m.nameIdentify benchmarks used to validate the architecture. · confidence 90% · unvalidated
MATCH (a:Architecture {name: 'Governed Memory'})-[:PERFORMS_ON]->(b:Benchmark) RETURN b.nameAbstract
Abstract:Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi-step executions; and silent quality degradation without feedback loops. We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open-set atomic facts with schema-enforced typed properties; tiered governance routing with progressive context delivery; reflection-bounded retrieval with entity-scoped isolation; and a closed-loop schema lifecycle with AI-assisted authoring and automated per-property refinement. We validate each mechanism through controlled experiments (N=250, five content types): 99.6% fact recall with complementary dual-modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross-entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema enforcement impose no retrieval quality penalty. The system is in production at this http URL.
Tags
Links
- Source: https://arxiv.org/abs/2603.17787v1
- Canonical: https://arxiv.org/abs/2603.17787v1
Full Text
64,926 characters extracted from source content.
Expand or collapse full text
Governed Memory: A Production Architecture for Multi-Agent Workflows Hamed Taheri Personize.ai Abstract Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi-step executions; and silent quality degradation without feedback loops. We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open-set atomic facts with schema-enforced typed properties; tiered governance routing with progressive context de- livery; reflection-bounded retrieval with entity-scoped isolation; and a closed-loop schema lifecycle with AI-assisted authoring and automated per-property refinement. We validate each mechanism through controlled experiments (N =250, five content types): 99.6% fact recall with complementary dual-modality coverage; 92% governance routing pre- cision; 50% token reduction from progressive delivery; zero cross-entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema en- forcement impose no retrieval quality penalty. The system is in production at Personize.ai. 1 Introduction 1.1 The Memory Governance Gap Enterprise AI adoption does not produce a single agent. It produces dozens of autonomous agent nodes distributed across workflows, tools, and teams: enrichment pipelines, outbound sequences, support automation, scoring models, research agents, and operational automations. Each node reads or writes information about the same entities, the same customers, companies, and deals, yet these nodes share neither a common memory of the entities they act upon nor a common governance layer enforcing organizational policies, compliance rules, and quality standards. In this setting, retrieval quality is necessary but insufficient. The organization faces five structural challenges that no single-agent memory system addresses: • Memory silos across agent workflows. The enrichment agent discovers a CTO is eval- uating three vendors. The outbound sequence agent, executing hours later, sends a generic cold email. The support agent resolves a critical pain point. Months later, the renewal agent resurfaces it as a selling feature. Each workflow node acts on the same entities but shares no context with the others. Organizational intelligence accumulates nowhere. • Governance fragmentation across teams and tools. Sales builds AI outreach with one system prompt embedding brand voice. Support runs a bot with compliance rules copied from a Notion doc last quarter. Marketing uses a separate workflow with its own tone 1 arXiv:2603.17787v1 [cs.AI] 18 Mar 2026 Governed Memory: A Production Architecture for Multi-Agent Workflows2 guidelines. When legal updates the data handling policy, no mechanism propagates it to the 14 agent configurations across three teams. There is no versioning, no single source of truth, and no way to ensure all agents operate under the same organizational rules. • Unstructured memory as a downstream dead end. Free-text memories can be re- trieved by similarity and pasted into a prompt. Beyond that, they are terminal. They cannot be filtered by buying stage, ranked by deal value, routed to conditional workflows, synchro- nized to a CRM, or aggregated across thousands of entities. Without schema-enforced typed properties, memory is useful for prompt augmentation but unusable by any downstream system that requires structured, queryable data. • Context redundancy in autonomous multi-step execution. Modern agents operate in autonomous loops, planning, acting, observing, re-planning, without human intervention between steps. Each step may invoke governance routing independently. Without session awareness, the same compliance policy is re-injected into every step, consuming context window capacity that should be reserved for task-specific reasoning and degrading model attention on fresh instructions [4]. • Silent quality degradation without operational feedback. Schemas age. Models get updated. Content types shift. New agent workflows produce data the schema was not designed for. No per-property accuracy monitoring exists. No extraction confidence is tracked over time. No schema drift is detected. The organization discovers the problem when a CRM field has been wrong for three months or a downstream pipeline quietly stops producing useful output. We term this the memory governance gap: the absence of an infrastructure layer governing what agents store, how stored information is typed and queried, which organizational policies reach which agent, how context is delivered across autonomous execution steps, and whether the system is performing reliably. 1.2 Why RAG Is Not Enough RAG [1, 6] established a foundational paradigm: ground model outputs in retrieved evidence. But RAG is a retrieval primitive, not an infrastructure layer. It addresses a single concern, retrieval relevance, and assumes a single agent, a single query, and a static document store. It provides no mechanism for governing what agents write into the store, no organizational context routing based on task requirements, no session-aware delivery across autonomous execution steps [4], no schema enforcement for downstream consumption, and no quality feedback loop for detecting degradation at scale. Governed Memory addresses the layer RAG leaves vacant. 1.3 Contributions This paper makes four contributions, each addressing one or more of the five challenges above: 1. A dual memory taxonomy with formal quality gates (addresses memory silos and the downstream dead-end problem). We distinguish open-set memory (coreference-resolved atomic facts stored as vector embeddings) from schema-enforced memory (typed property values governed by organizational schemas with confidence scores), processed in a single extraction pass with automated quality gates. The shared store enables any agent across the organization to read and write entity memory through a common interface. 2. Tiered governance routing with progressive context delivery (addresses governance fragmentation and context redundancy). A mechanism for selecting which organizational Governed Memory: A Production Architecture for Multi-Agent Workflows3 context should be injected into an agent’s context window, supporting a fast governance- aware hybrid path (∼850ms average, E3) and a full two-stage LLM selection path (∼2–5s), with session-aware delta delivery that tracks previously injected context across autonomous multi-step executions. 3. Reflection-bounded retrieval with entity-scoped isolation (addresses memory si- los). An iterative protocol checking evidence completeness and generating targeted follow-up queries within bounded rounds, combined with CRM-key-based entity scoping that enforces hard isolation across tenants and entities. 4. Schema lifecycle management with closed-loop self-evaluation (addresses the down- stream dead-end and silent quality degradation). A lifecycle spanning AI-assisted schema authoring, interactive enhancement, criteria-based rubric scoring with execution logging, and automated per-property schema refinement. 2 Related Work 2.1 Retrieval-Augmented Generation and Iterative Retrieval RAG [1] and its extensions, scale [5], self-reflection [3], iterative correction [11], and structure- preserving hierarchical selection [17, 18, 19], address retrieval quality but leave four gaps: no governance over what is stored, no organizational context routing, no session-aware delivery, and no quality feedback loop. Document expansion by query prediction [23] improves retrieval by appending LLM-generated queries to documents at index time; HyDE [24] takes the in- verse approach, generating hypothetical documents for a given query. Our governance routing layer adapts the document-expansion strategy to organizational context (Section 5), but the core contribution of this work is the governance architecture itself rather than novel retrieval primitives. 2.2 Agent and Production Memory Systems Single-agent memory systems [7, 2, 8, 9, 10, 22] model memory for individual agents in controlled environments; they do not address organizational context, schema enforcement, or multi-tenant isolation. Production memory layers, SimpleMem [15] and Mem0 [16], formalize write-time at- omization into coreference-resolved facts with semantic deduplication. These are memory prim- itives: they address how individual facts are stored and retrieved. Governed Memory operates at the infrastructure layer above, extending any memory substrate with schema enforcement, organizational governance routing, entity-scoped multi-tenant isolation, and closed-loop qual- ity feedback. The relationship is architectural, not competitive; the governed layer requires capable memory primitives underneath, and the contributions of this paper are orthogonal to improvements in atomic fact extraction. 2.3 Evaluation and Context Delivery Domain-specific rubric scoring with execution trace capture extends LLM-as-judge [12] and reference-free evaluation [13] to the governance setting. Liu et al. [4] demonstrated that models under-use information in long-context middles, motivating session-aware progressive delivery that injects only delta content on each step. 2.4 Positioning Summary Prior systems address individual memory capabilities in isolation: SimpleMem and Mem0 pro- vide memory primitives (atomic fact extraction, deduplication); Self-RAG introduces reflective Governed Memory: A Production Architecture for Multi-Agent Workflows4 retrieval; MemGPT explores tiered memory management. These operate at the level of a single agent’s memory. No prior system provides the shared infrastructure layer above, combining schema-enforced typed memory, organizational governance routing, progressive context deliv- ery, closed-loop schema refinement, and multi-tenant entity isolation into a unified architecture accessible to any agent across an organization. Governed Memory addresses this gap. 3 Architecture Overview Governed Memory is organized as a four-layer architecture (Figure 1). Each layer addresses a distinct governance concern and can be independently configured or disabled. Layer 1: Dual Memory Store. Content is processed by a dual extraction pipeline pro- ducing open-set memories (atomic, self-contained facts) and schema-enforced memories (typed property values governed by organizational schemas), stored in a unified memory entry format with entity scoping and organizational isolation (Section 4). Layer 2: Governance Routing. A tiered router selects which organizational context, policies, guidelines, templates, should be injected for a specific task. A fast governance-aware hybrid path (∼850ms average, E3) ranks candidates without LLM calls; a full two-stage path uses LLM structured analysis (∼2–5s). A session layer delivers only delta content (Section 5). Layer 3: Governed Retrieval. Vector similarity search with entity-scoped filtering and a reflection loop checking evidence completeness within bounded rounds (Section 6). Layer 4: Schema Lifecycle and Quality Feedback. AI-assisted schema authoring, domain-specific rubric evaluation with execution logging, and automated per-property refine- ment closing the feedback loop (Section 7). All memory entries share a unified record structure carrying entity scope, organizational partition, provenance metadata (content hash, extraction method, model identifier, chunk po- sition, redaction status), and type-specific fields for schema-enforced properties. All operations are partitioned by organization ID, providing hard tenant isolation. Within an organization, retrieval can be further scoped to a specific entity using CRM keys, preventing cross-entity contamination. The system enforces a two-phase content redaction pipeline scrubbing PII and secrets before and after LLM extraction. Full data model, entity scoping, security posture, and redaction details are provided in Sections A and B. The system exposes a standard MCP inter- face and SDK, enabling any compatible agent, regardless of framework or vendor, to read, write, and govern memory through the same organizational context without bespoke integration. 4 Dual Memory Model 4.1 Problem Statement Enterprise agents must store both unstructured insights and structured property values. A sales call transcript contains the free-form insight “The CTO mentioned they are evaluating three vendors” alongside the structured fact that the deal value is $450,000. Existing memory systems support one modality or the other. The dual memory model stores both simultaneously from the same extraction pass, ensuring no information is lost to modality mismatch. 4.2 Open-Set Memory Open-set memories are atomic, self-contained facts extracted from unstructured content. The extraction prompt enforces five invariants: completeness, self-containment, coreference resolu- tion, temporal anchoring, and atomicity. Three lightweight quality gates are computed per ex- traction batch, coreference score (pronoun detection), self-containment score (syntactic pattern Governed Memory: A Production Architecture for Multi-Agent Workflows5 Governed Memory — Four-Layer Architecture Agent AAgent BAgent CMCP / SDKAgent N Shared API Surface — Organization-Scoped, Entity-Isolated LAYER 4Schema Lifecycle & Quality FeedbackSection 7 AI Schema Authoring NL → typed properties Rubric Scoring domain-specific /100 Execution Logging trace + memory usage Auto Refinement per-property optimize closed-loop feedback LAYER 3Governed RetrievalSection 6 Query Embed vector similarity Entity Filter CRM key scoping Reflection Loop ≤2 rounds bounded Merge & Dedup cross-round merge LLM Synthesis source attribution LAYER 2Governance RoutingSection 5 Fast Mode ~850ms · no LLM call Full Mode ~2-5s · LLM analysis Progressive Delivery session delta · 50% token ↓ Governance Variables HyPE · scope · embeddings LAYER 1Dual Memory StoreSection 4 Open-Set Memories atomic facts · vector embedded Schema-Enforced Properties typed values · confidence scored Dual Extraction single LLM pass · both modalities Entity Scoping orgId + CRM key isolation FIGURE 1 — GOVERNED MEMORY ARCHITECTURE Figure 1. Governed Memory four-layer architecture. Agent nodes interact through a shared, organization-scoped API surface. Layers are independently configurable. matching), and temporal anchoring score (relative-time pattern detection), serving as early- warning operational signals. Open-set memories are embedded and stored in a vector database with per-organization partitioning. Before insertion, each candidate is compared against ex- isting entries; candidates exceeding a cosine similarity threshold (default: 0.92) are skipped, preventing near-duplicate accumulation. 4.3 Schema-Enforced Memory Schema-enforced memories are typed property values extracted according to an organizational schema defining properties with names, descriptions, types (text, number, date, boolean, op- tions, array), and extraction hints. Property selection. Before extraction, the system selects relevant properties using embed- ding similarity between content and property metadata, with a minimum score threshold and maximum count cap. This prevents presenting the LLM with hundreds of irrelevant properties, reducing hallucination and improving type compliance. Dual extraction. A single LLM call receives both content and selected property definitions, producing two parallel outputs: (1) typed property values with confidence scores and update semantics, and (2) open-set atomic facts. This ensures the same content is processed once. Extracted values are validated against schema-declared types. Values carry explicit update flags supporting both single-value replacement and temporal accumulation. Extraction confidence scores follow the observation that language models can calibrate their own certainty [14]. 4.4 Algorithm: Dual Extraction Pipeline The extraction pipeline proceeds as: (1) optional pre-extraction PII redaction; (2) content chunking with overlap; (3) per-chunk dual extraction with property selection; (4) post-extraction redaction scan; (5) cross-chunk deduplication (highest-confidence properties, normalized-text facts); (6) quality gate computation and logging; (7) embedding generation; (8) provenance attachment; and (9) write-side deduplication against the existing store. Algorithm description Governed Memory: A Production Architecture for Multi-Agent Workflows6 Dual Extraction Pipeline 1 Raw Content calls · emails · docs 2 PII Redaction pre-extraction scan 3 Content Chunking overlap windows 4 Property Selection embed sim → top-k props 5 Dual Extraction single LLM call produces both modalities in parallel open-set factstyped properties Open-Set Memories atomic · self-contained · temporal Schema-Enforced Values typed · confidence-scored 6 Post-Redaction second PII scan 7 Cross-Chunk Dedup highest confidence wins 8 Quality Gates coref · self-contain · time 9 Embed & Prove text-embedding-3-small 10 Write Dedup cosine ≥ 0.92 → skip Memory Store LanceDB + DynamoDB Background Consolidation (0.95) Redaction / gateQuality checkCore extraction FIGURE 2 — DUAL EXTRACTION PIPELINE (ALGORITHM 2) Figure 2. Dual extraction pipeline. A single LLM call produces both open-set facts and schema- enforced typed properties, followed by quality gates and write-side deduplication. is provided in Section C (Algorithm 2). A background consolidation process (Section D, Algorithm 3) periodically merges near- duplicate memories and prunes stale entries, using a deliberately higher similarity threshold (0.95) than write-side deduplication to minimize false merges. 5 Governance Routing 5.1 Governance Variable Model Organizational context is stored as governance variables, also called guidelines with structured metadata including name, description, tags, content, heading hierarchy, and content-aware em- beddings. When a variable is created or updated, three enrichment steps run in parallel: (1) Hy- pothetical Prompt Enrichment (HyPE), generating synthetic queries representing plausible agent requests, adapting the document-expansion-by-query-prediction approach [23] (which appends predicted queries to documents before indexing) to the governance setting, with naming inspired by HyDE [24] (which instead generates hypothetical documents for a given query); (2) gover- nance scope inference, an LLM determines whether the variable is always-on and infers trigger keywords; and (3) content-aware embedding computed from metadata and content preview. 5.2 Tiered Routing Modes Fast mode (∼850ms average, E3). No LLM call. Each candidate is scored using a weighted composite of embedding similarity and keyword overlap against variable metadata and HyPE- generated queries, plus a governance scope boost for always-on variables. Results are partitioned into critical and supplementary sets with dynamic caps. Algorithm description is provided in Section C. Full mode (∼2–5s). Two-stage pipeline: embedding pre-filter reducing candidates, followed by LLM multi-step structured analysis classifying context as critical or supplementary with section-level extraction capability. Auto mode (default). Selects fast or full based on library characteristics. 5.3 Progressive Context Delivery Modern agents increasingly operate in autonomous multi-step loops, planning, acting, observing, re-planning, and acting again, without human intervention between steps. In such workflows, an Governed Memory: A Production Architecture for Multi-Agent Workflows7 Governance Routing — Tiered Modes with Progressive Delivery Governance Variable Library name · description · tags · content heading hierarchy · embeddings HyPE queries (enrichment) synthetic agent request queries Scope inference always-on · trigger keywords Content-aware embedding metadata + content preview Agent Task query + session context Auto FAST Fast Mode ~850ms embed sim + keyword overlap HyPE query matching scope boost for always-on vars FULL Full Mode ~2–5s Stage 1: embedding pre-filter Stage 2: LLM structured analysis section-level extraction Classify critical vs supplementary Progressive Context Delivery Session-aware delta injection Session State delivered variables delivered sections Delta Logic exclude delivered promote supplementary Step 1 full governance set → record delivered Step 2+ only new/changed → 50% token ↓ Critical + Supplementary → Agent Context supplementary items NOT recorded → can promote later Fast path (no LLM)Full path (LLM analysis)Progressive session layerOutput to agent 92% routing precision · 88% recall · 20–50p discovery improvement from AI-assisted authoring · 50% token reduction via progressive delivery FIGURE 3 — GOVERNANCE ROUTING WITH PROGRESSIVE CONTEXT DELIVERY Figure 3. Governance routing with tiered modes (fast/full) and progressive context delivery. The session layer tracks delivered variables to inject only delta content on each autonomous step. agent may invoke governance routing multiple times within a single session as its task evolves and new context becomes relevant. Simply re-injecting the full governance set on every step creates three compounding problems: (1) token bloat, as context windows fill with previously delivered material rather than new task-relevant guidance; (2) accuracy degradation, as redundant context competes with fresh instructions for the model’s attention [4]; and (3) unnecessary cost, since each duplicated variable consumes tokens billed per step. Progressive delivery addresses this by maintaining a session state record that tracks which variables (and which sections within them) have already been delivered. On each routing call, already-delivered variables are excluded; only new or newly relevant content is resolved and in- jected. Supplementary items are intentionally not recorded, allowing promotion on subsequent calls if the task evolves, ensuring that a guideline initially deemed peripheral can surface as crit- ical when the agent’s plan shifts. Algorithm description is provided in Section C (Algorithm 5). 6 Reflection-Bounded Retrieval 6.1 Retrieval Architecture The retrieval path proceeds as: (1) query embedding; (2) vector search within the organization partition with entity-scoped CRM key filters; (3) post-filtering by metadata (persons, entities, location, timestamp range, memory type); (4) optional reflection loop; (5) merge and dedupli- cation; and (6) optional LLM answer synthesis with source attribution. 6.2 Reflection Loop The reflection loop is bounded by a configurable maximum round count (default: 2). Each round: an LLM judges evidence completeness at low temperature (0.1), and if incomplete, generates one to two targeted follow-up queries at moderate temperature (0.3). Results are merged by identifier across rounds. Each round adds predictable latency: one LLM call plus zero to two embedding-and-search operations. Algorithm description is provided in Section C (Algorithm 6). Governed Memory: A Production Architecture for Multi-Agent Workflows8 Reflection-Bounded Retrieval with Entity Isolation 1 Query user or agent request 2 Query Embed text-embedding-3-small 3 Vector Search org partition + CRM key filter entity-scoped isolation 4 Post-Filter persons · time · type · loc 6 Merge & Dedup cross-round by ID 7 LLM Synthesis answer + source attribution REFLECTION LOOP — bounded ≤ 2 rounds Evidence Completeness LLM judge @ temp 0.1 complete? → exit loop done? no Follow- up Qs temp 0.3 re-embed & search → merge results Re-Search embed follow-up → vector search yes optional Hybrid Retrieval — Both Memory Types Schema-Enforced typed properties (priority) Open-Set facts (by recency) Entity Context token-budgeted block Entity Isolation — Adversarial Validation 0 true leakage 3,800 results · 500 queries 2.74% flag rate all false positives CRM key filter pre-search isolation +25.7p completeness (2 rounds) · 62.8% on hard multi-hop · zero cross-entity leakage · most gains in first additional round FIGURE 4 — REFLECTION-BOUNDED RETRIEVAL WITH ENTITY-SCOPED ISOLATION Figure 4. Reflection-bounded retrieval with entity-scoped isolation. The optional reflection loop gen- erates targeted follow-up queries within bounded rounds; entity isolation is enforced by CRM key pre- filtering. 6.3 Hybrid Retrieval Retrieval operates across both memory types simultaneously, returning results in a unified format. A standalone entity context injection endpoint compiles per-entity data from both storage tiers into a token-budgeted context block. The endpoint resolves entity identity through CRM keys, fetches schema-enforced property values and open-set memories, and compiles them into a structured block with Properties and Observations sections. Token budget enforcement prioritizes schema-enforced properties (more structured, actionable) over open-set memories (ordered by recency), enabling any downstream consumer to obtain entity context without invoking governance routing. 7 Schema Lifecycle and Self-Evaluation This section describes the full schema lifecycle, from creation through interactive improvement to automated refinement, designed around a central principle: schemas are living documents that evolve through human guidance and AI-augmented feedback. 7.1 Schema Authoring and Interactive Enhancement AI-assisted authoring. Users describe, in natural language, what structured information they need to extract from their data, e.g., “I want to capture each contact’s role, buying intent, and preferred communication channel.” An AI assistant within the web application translates this description into a complete schema definition, generating typed property specifications with names, descriptions, extraction instructions, and validation constraints. This streaming generation service accelerates the translation from user intent to well-formed schema without requiring expertise in schema design. Interactive enhancement. Once a schema is in use, operators describe observed issues in natural language, e.g., “the role field is too vague” or “buying intent should distinguish between active evaluation and future interest”, and receive streaming revised property definitions, at per-property or bulk granularity. Governed Memory: A Production Architecture for Multi-Agent Workflows9 7.2 Criteria-Based Rubric Scoring Sections 7.2 to 7.4 are currently available via API to our internal team and select early-access users for iterative accuracy and performance improvement. Users can invoke the platform’s native agents or prompt endpoints via API, and then eval- uate how governance context and retrieved memory influenced output quality. Each agent interaction is scored against domain-specific rubrics. The system provides four presets (default, sales, support, research), each normalized to 100 points with weighted criteria. Organizations can define custom rubrics. Evaluation bias is mitigated through rubric-first prompting, trace- grounded evaluation (access to full execution trace including tool calls and memory operations), and configurable cross-model evaluation. 7.3 Execution Logging Beyond scoring, evaluation captures a structured trace: conversation summary, tool usage log, memory recall log (with usage flags), memory creation log, and governance context log with helpfulness ratings. This enables diagnostic questions scalar scores cannot answer: “Did the agent score low on Completeness because it failed to recall relevant memories, or because it recalled them but did not use them?” 7.4 Aggregation and Quality Feedback Evaluation records enable low-score detection, per-criterion breakdown, trend analysis anno- tated with schema and model changes, and per-endpoint comparison across models. Low con- text utilization scores inform governance improvements; low recall completeness scores inform reflection configuration. Recurring diagnostic patterns are summarized in Section G, Table 8. 7.5 Automated Schema Refinement Schema refinement is available to all users through the web application. When extraction quality is evaluated, the system executes a three-phase pipeline: (1) extraction replay producing baseline results; (2) per-property analysis classifying each property as extracted, missed, low-confidence, inaccurate, or unavailable, generating structured improvement instructions; (3) parallel per- property optimization producing revised definitions with change annotations. The three-phase design separates objective data, diagnostic judgment, and targeted fixes. Running Phase 3 in parallel keeps latency bounded. Algorithm description is provided in Section C (Algorithm 7) and a worked example in Section G. 8 Experimental Evaluation 8.1 Evaluation as Operational Monitoring These experiments serve a dual purpose: validating the core architectural mechanisms and defining a repeatable evaluation methodology that organizations can apply as ongoing operational monitoring, running continuously as schemas, content types, and underlying models evolve. Each experiment targets a specific governance concern and maps directly to a production health signal: extraction quality, routing precision, retrieval completeness, entity isolation, conflict detection, and end-to-end output quality. The metrics introduced here, governance routing precision and recall, schema discovery rate, context defect rate, and memory density curves, are proposed as standard instrumentation for production memory systems, not one-time benchmark scores. Small controlled datasets are intentional: they are monitoring templates, designed for fast, interpretable re-execution, not underpowered one-shot studies. Sample sizes were validated by scaling: experiments began at N =10 per content type and were incrementally increased to Governed Memory: A Production Architecture for Multi-Agent Workflows10 Table 1. Experimental dataset summary. DatasetSamplesGround Truth Elements Primary corpus250 samples, 5 types 8–12 facts, 5–8 props/sample Multi-source entity5 sources40 unique facts, 8 cross-source dupes Entity isolation profiles100 entitiesHigh overlap with unique markers Recall query sets500 queriesExpected topics & min source counts Governance variable pairs 5 pairsKnown targets for 15 tasks Conflict pairs30 pairs, 15 catsStale + fresh claims with known dates Table 2. Extraction quality by content type (250 samples). Content Type Samples Fact Recall Call notes50100% Documents50100% Emails50100% Transcripts50100% Chats5098% Overall25099.6% N =50; key metrics (fact recall, routing precision, entity isolation) stabilized with minimal variance beyond N =30 per type, confirming that N =250 is sufficient for the effects being measured. We designed experiments using synthetic datasets with embedded ground truth, enabling reproducible measurement with known fact counts, property values, coreference issues, and near-duplicates. All experiments use a fixed random seed (42) and were executed against the production API. Synthetic data results are designed to stress distinct extraction challenges under controlled conditions. 8.2 Extraction Quality Across Content Types (E1) Fact recall is consistently 99–100% regardless of content format. We acknowledge that these near-perfect scores reflect evaluation on synthetic datasets that, while structurally diverse across five content types and designed to stress distinct extraction challenges (coreference, temporal reasoning, implicit facts), are free of the noise, formatting inconsistencies, and ambiguity typi- cal of production data. The full evaluation datasets and ground-truth annotations are publicly available at https://github.com/personizeai/governed-memory. The results should there- fore be interpreted as evidence that the extraction architecture and algorithm reliably capture ground-truth facts under controlled, diverse conditions, an upper-bound demonstration of the pipeline’s capability, rather than as a claim of identical performance on arbitrary real-world inputs. Production deployments exhibit comparable but modestly lower recall, consistent with the additional noise in organic content. The schema refinement mechanisms (Section 7) ad- dress property extraction variance, which is predominantly attributable to content structure and schema maturity rather than algorithmic limitations. 8.3 Quality Gates Ablation (E9) On 40 samples, retrieval with quality gates reduces the output defect rate by 25% relative compared to raw retrieval (6.3% vs. 8.4%). Temporal accuracy improves by 6.8 percentage points (95.2% vs. 88.4%), and signal-to-noise ratio (useful facts retrieved vs. noise) increases from 1.1:1 to 4.2:1. These gains arise from the write-time quality gate pipeline filtering coreference- unresolved, non-self-contained, and temporally ambiguous facts before they enter the store, Governed Memory: A Production Architecture for Multi-Agent Workflows11 Table 3. Coverage distribution across memory modalities (20 samples). Category% of Total Captured by both modalities34% Open-set only (long-tail insights)38% Schema-enforced only (typed values)12% Missed by both16% Combined recall82.8% downstream retrieval inherits cleaner signal. Decision precision (94.5%) validates the gate as a reliable guard rather than a coarse filter. 8.4 Dual Memory Complementarity (E12) The 38% captured exclusively by open-set memory, relational facts, qualitative observations, contextual details, would be permanently lost in a schema-only system. The 12% captured exclusively by schema enforcement would lack type enforcement in an open-set-only system. This validates the dual architecture as complementary rather than redundant. Two unique advantages of schema-enforced properties are worth noting. First, the schema is not fixed: the schema refinement mechanisms (Section 7) allow operators to evolve property definitions over time, progressively promoting observed patterns into typed, queryable structure as the deployment matures. Second, and more consequential for production use: schema- enforced properties fulfill a role that open-set memories, which are stored as free-form text, cannot. Structured properties are directly addressable by downstream query logic, filtering pipelines, and expert systems; they carry type guarantees and can drive conditional routing and decision logic without natural language parsing. A compliance constraint or communication preference that exists only as free-form memory is retrievable but not directly queryable; the same fact captured as a schema-enforced property becomes a first-class signal in any downstream system. The practical ceiling of the dual architecture is therefore higher than the coverage numbers alone suggest: the 38% open-set-only bucket preserves recall that would otherwise be lost, while the schema layer, continuously refinable, determines how much of that captured knowledge becomes structurally actionable. 8.5 Governance Routing Effectiveness (E3, E13) Routing precision. Against 25 governance variables across 5 categories, routing achieves 92% precision (nearly all selected variables are relevant) and 88% recall across 20 diverse task types spanning sales, compliance, engineering, marketing, and support. Authoring quality impact. Well-authored governance variables are 20–50 percentage points more discoverable than poorly-authored equivalents: in 3 of 5 categories (brand, product, support), poorly-authored variables scored 0% discovery rate. This validates the AI- assisted authoring tools (Section 5, Section 7) as operationally significant. 8.6 Reflection-Bounded Retrieval (E10) Manual multi-hop retrieval with 2 rounds achieves 62.8% completeness versus 37.1% base- line, a 25.7 percentage point improvement on hard multi-faceted queries where information is scattered across 3–5 sources. API-managed reflection (+3.3p) shows more modest gains than manual multi-hop (+25.7p), indicating that query generation strategy is the key determinant of reflection effectiveness. This gap does not reflect a retrieval quality difference between the two paths; it reflects the investment in query strategy. API-managed reflection applies generic follow-up queries; manual multi-hop allows developers to decompose complex questions into Governed Memory: A Production Architecture for Multi-Agent Workflows12 Table 4. Reflection ablation on 10 hard multi-hop queries (4 conditions). ConditionAvg Compl. Avg Results Avg Latency No reflection (baseline)37.1%15.09.4s API-managed, 1 round40.4%22.510.4s Manual multi-hop, 1 round61.2%20.86.5s Manual multi-hop, 2 rounds62.8%21.910.0s Table 5. Semantic conflict resolution (30 pairs). MetricDefinitionResult Conflict detection (primary) Fresh info surfaced in answer83.3% Full stale suppres- sion (strict) Answer contains only fresh keywords33.3% Incorrect staleAnswer reflects only the outdated claim3.3% targeted retrieval passes tuned to their domain. The completeness ceiling on hard multi-faceted queries is therefore set by application-layer query design, not by the memory system itself. The single-round manual condition (61.2%) approaches the two-round ceiling, suggesting most completeness gains materialize in the first additional retrieval pass. Reflection is effective when the memory store contains relevant but scattered information; cases where underlying data is absent yield diminishing returns regardless of round count. 8.7 Entity Isolation (E11) Under adversarial conditions, 100 entities with same industry, similar roles, overlapping names, and similar deal sizes, entity-scoped retrieval produces zero true cross-entity leakage across 3,800 results (500 queries × 5 query types). Of the 2.74% observed flag rate (104 flags), all are false positives attributable to shared name tokens across distinct records, not actual memory bleed. Isolation is enforced by the CRM key pre-filtering mechanism, not embedding distinc- tiveness. 8.8 Semantic Conflict Resolution (E14) When the same entity accumulates contradictory facts over time (e.g., a company changes its primary database), the system must surface the most recent claim. E14 tests this with 30 conflict pairs across 15 categories (database, cloud provider, team size, budget, etc.), each consisting of a stale memorization (74–270 days old) followed by a fresh memorization (0–57 days old) for the same contact. Retrieval applies exponential recency decay (half-life = 38 days) to rank recent facts above outdated ones. The primary correctness measure is conflict detection: whether the fresh claim is present in the answer at all. Full stale suppression applies a stricter standard: the answer must contain only the fresh claim with zero reference to stale content. Of the 15 both_present verdicts, all 15 answers lead with the fresh claim and reference the stale value only as transition context (e.g., “migrated from AWS to Google Cloud”). Effective answer correctness, fresh claim present and presented as current, is 83.3%. The single incorrect- stale case involved a pain-point category where the fresh and stale claims shared overlapping vocabulary, causing keyword-based evaluation ambiguity. Governed Memory: A Production Architecture for Multi-Agent Workflows13 Table 6. End-to-end evaluation (10 prospects, 3 runs each, sales rubric /100). ConditionAvg Score /100∆ vs. Baseline A: No memory79.5, B: Raw memory85.2+5.7 C: Open-set + governance86.4+6.9 D: Full governed memory85.9+6.4 Table 7. LoCoMo benchmark results. Category Accuracy vs. Human Baseline Single-hop78.7%−16.4p Multi-hop51.7%−34.1p Temporal64.6%−28.0p Open-ended 83.6%+8.2p Overall74.8%−13.1p 8.9 End-to-End Ablation (E8) Memory provides the primary quality gain (A→B: +5.7 pts). Governance routing adds measurable refinement (B→C: +1.2 pts). Full governed memory (D, +6.4 pts) scores compa- rably to C on this rubric (−0.5 pts), an expected result given that email-generation quality metrics measure tone, framing, and personalization, dimensions where open-set facts and gov- ernance context already provide the dominant signal. Schema enforcement’s primary value is realized downstream of generation: typed, validated property values enable reliable CRM syn- chronization, analytics aggregation, and structured API consumption that are orthogonal to single-interaction quality scores. Organizations can iteratively improve schema extraction accu- racy using the AI-assisted schema optimization tools (Section 7), refining property definitions based on observed extraction patterns and measuring downstream impact through the same self-evaluation pipeline. The combined system (D, +6.4 pts) consistently outperforms the no- memory baseline across all runs, validating the core thesis. It should be noted that the rubric used here is likely an underestimate of the true quality differential: a rubric optimized for single- interaction email quality cannot capture the compounding gains that schema enforcement and governed memory produce at scale, across repeated interactions, multi-record aggregation, con- ditional personalization logic, and downstream system reliability. A rubric designed to measure those dimensions would be expected to show a larger separation between conditions. 8.10 Adversarial Governance (E15) Against 50 adversarial scenarios designed to bypass governance constraints, distributed across easy, medium, and hard difficulty, the system achieves 100% compliance across all difficulty levels and a 96% guardrail activation rate (48/50 scenarios triggered the intended guardrail; 2 easy-category inputs resolved correctly without explicit guardrail invocation). Zero organiza- tional policy leakage was observed. This validates the governance layer as robust to deliberate constraint circumvention attempts. 8.11 External Benchmark Validation: LoCoMo As external validation that the system functions as a general-purpose long-term memory system, we evaluated on LoCoMo [21] (272 sessions, 1,542 questions across 10 conversations). The system achieves 74.8% overall against a human baseline of 87.9%, exceeding human- level on open-ended inference (83.6% vs. 75.4%), the largest category (841 questions). This result confirms that the governance, schema enforcement, and entity isolation layers impose no Governed Memory: A Production Architecture for Multi-Agent Workflows14 retrieval quality penalty, the architecture achieves state-of-the-art memory accuracy despite do- ing substantially more than standalone memory systems. For context, independently evaluated memory systems report 42–67% on comparable settings: Mem0 [16] at 64–67%, Zep at 42–66%, OpenAI built-in memory at ∼53%. Multi-hop (51.7%) and temporal (64.6%) remain active op- timization areas. Results use a hybrid text-match-first / LLM-judge-fallback methodology; 950 of 1,153 correct answers (82.4%) scored via text-match. Published systems use varying method- ologies, pure token-overlap F1 or pure LLM-as-judge, producing scores not directly comparable across systems. 9 Discussion 9.1 Limitations Quality gates are heuristic. Coreference, self-containment, and temporal anchoring scores use pattern-based heuristics rather than deep semantic analysis. Calibration against human judgments would strengthen confidence. Self-evaluation relies on self-judgment, with mitigations (rubric-first prompting, cross-model evaluation) that reduce but do not eliminate LLM-as-judge biases [12]. Deduplication thresholds are empirically tuned (0.92 write- side, 0.95 consolidation); adaptive thresholds based on content characteristics would be more robust. Redaction is regex-based [20], covering well-structured PII with high precision but potentially missing obfuscated or context-dependent patterns. Multi-agent write conflicts are unvalidated. E14 evaluates temporal conflicts between sequential writes (stale vs. fresh claims over time) but does not test concurrent writes from multiple agents acting on the same entity simultaneously. As a shared layer serving many agents, concurrent write conflicts are a realistic production scenario; conflict detection and resolution under concurrent conditions remain an open problem. 9.2 Design Tensions Open-set vs. schema-enforced memory is a spectrum: some facts extracted as open-set could map to schema properties. The dual model accepts this overlap deliberately, ensuring no in- formation is lost. Reflection yields strong gains (+25.7p) under optimal query generation but shows diminishing returns when underlying data is absent; the gap between manual multi-hop (62.8%) and API-managed reflection (40.4%) indicates that query generation strategy, rather than round count, is the primary lever (Section 8). Making the round bound query-adaptive and improving API query generation are planned future work. Progressive delivery assumes previously delivered context remains relevant; the 24-hour session TTL limits but does not eliminate staleness. 9.3 Future Work Key remaining gaps include: cross-organization validation measuring whether patterns hold across varying content and schema maturity, including multi-agent write compatibility and contradiction detection; automated schema expansion proposing new properties from unex- tracted content patterns; hybrid retrieval combining semantic search with keyword lanes and reciprocal rank fusion; and ML-augmented redaction supplementing regex patterns with transformer-based NER for non-standard PII. 9.4 Ethical Considerations The system stores extracted facts about people, and memory accumulation creates detailed profiles. Organizations bear responsibility for regulatory compliance (GDPR, CCPA) and data Governed Memory: A Production Architecture for Multi-Agent Workflows15 subject rights. Schema-enforced extraction reflects biases in both language models and operator- defined schemas; organizations should review schemas for discriminatory patterns. The gover- nance layer provides a mitigation: organizational policies are explicit, auditable artifacts rather than implicit model behaviors. 10 Conclusion This paper identified the memory governance gap, five structural challenges arising when dozens of autonomous agent nodes act on the same entities across workflows without shared memory or common governance, and presented Governed Memory, a shared memory and governance layer addressing it through four integrated mechanisms. Experimental validation confirms that each mechanism works at production scale: 99.6% fact recall with complementary dual-modality coverage; 92% governance routing precision; zero cross-entity leakage across 3,800 adversar- ial queries; 100% adversarial governance compliance; and 74.8% on the LoCoMo benchmark, confirming that governance, schema enforcement, and entity isolation impose no retrieval qual- ity penalty. Output quality saturates at approximately seven governed memories per entity, establishing a practical operating point for agentic deployments. The terminology introduced, governed memory, governance routing, progressive context delivery, memory quality gates, schema lifecycle management, provides reference points for the growing community building production memory systems for agentic workflows. As au- tonomous agent deployments scale across organizations, accuracy and compliance across dis- tributed agent nodes become architectural requirements, not afterthoughts. The system de- scribed is commercially deployed; code, datasets, and extended supplementary material are available at https://github.com/personizeai/governed-memory. Governed Memory: A Production Architecture for Multi-Agent Workflows16 References [1] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W.-T. Yih, T. Rocktaschel, S. Riedel, and D. Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In NeurIPS, 2020. [2] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. In NeurIPS, 2023. [3] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In ICLR, 2024. [4] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the Middle: How Language Models Use Long Contexts. TACL, 12:157–173, 2024. [5] S. Borgeaud et al. Improving Language Models by Retrieving from Trillions of Tokens. In ICML, 2022. [6] Y. Gao et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997, 2023. [7] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative Agents: Interactive Simulacra of Human Behavior. In UIST, 2023. [8] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct: Synergizing Reasoning and Acting in Language Models. In ICLR, 2023. [9] C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560, 2023. [10] W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang. MemoryBank: Enhancing Large Language Models with Long-Term Memory. In AAAI, 2024. [11] S.-Q. Yan, J.-C. Gu, Y. Zhu, and Z.-H. Ling. Corrective Retrieval Augmented Generation. arXiv:2401.15884, 2024. [12] L. Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In NeurIPS, 2023. [13] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert. RAGAS: Automated Evaluation of Retrieval Augmented Generation. In EACL, 2024. [14] S. Kadavath et al. Language Models (Mostly) Know What They Know. arXiv:2207.05221, 2022. [15] J. Liu, Y. Su, P. Xia, S. Han, Z. Zheng, C. Xie, M. Ding, and H. Yao. SimpleMem: Efficient Lifelong Memory for LLM Agents. arXiv:2601.02553, 2026. [16] P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav. Mem0: Building Production- Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413, 2025. [17] S. Wang, Y. Zhou, and Y. Fang. BookRAG: A Hierarchical Structure-aware Index-based Approach for Retrieval-Augmented Generation. arXiv:2512.03413, 2025. [18] E. Lumer et al. Rethinking Retrieval: From Traditional RAG to Agentic and Non-Vector Reasoning Systems in the Financial Domain. arXiv:2511.18177, 2025. Governed Memory: A Production Architecture for Multi-Agent Workflows17 [19] Z. Li, H. Tian, L. Luo, Y. Cao, and P. Luo. DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search. arXiv:2602.05014, 2026. [20] Microsoft. Presidio , Data Protection and De-identification SDK. GitHub, 2024. [21] A. Maharana, D. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fung. Evaluating Very Long-Term Conversational Memory of LLM Agents. In ACL, 2024. [22] D. Jha, M. Gupta, D. Ganguly, and V. Varma. ENGRAM: Efficient Graph-Enhanced Memory Networks for Long-Term Conversational Agents. arXiv:2511.12960, 2025. [23] R. Nogueira, W. Yang, J. Lin, and K. Cho. Document Expansion by Query Prediction. arXiv:1904.08375, 2019. [24] L. Gao, X. Ma, J. Lin, and J. Callan. Precise Zero-Shot Dense Retrieval without Relevance Labels. In ACL, 2023. Governed Memory: A Production Architecture for Multi-Agent Workflows18 A Data Model and Entity Scoping All memory entries share a unified record structure: MemoryEntry: id : string -- unique identifier text : string -- atomic fact or normalized property statement orgId : string -- organizational partition key recordId : string? -- entity scope (contact, company, deal) type : string -- "memory" | "property_value" keywords : string[] -- extracted keywords persons : string[] -- mentioned persons entities : string[] -- mentioned entities location : string? -- geographic reference topic : string? -- topic classification timestamp : string? -- temporal anchor (ISO 8601) customAttributes: map -- schema-enforced typed values source : string -- provenance marker score : float? -- relevance score (populated during retrieval) createdAt : string updatedAt : string -- Property-specific fields (when type = "property_value") propertyId : string? -- schema property identifier propertyName : string? -- human-readable property name systemName : string? -- canonical system name propertyValue : string? -- extracted value (serialized) collectionId : string? -- schema collection scope confidence : float? -- extraction confidence (0.0--1.0) -- Provenance metadata (stored in customAttributes) provenance : map? contentHash : string -- SHA-256 of first 1000 chars contentLength : int -- original source content length speaker : string? -- identified speaker extractionMethod : string -- "single_extract" | "dual_extract" llmModel : string? -- model identifier chunkIndex : int? -- chunk position (0-based) chunkTotal : int? -- total chunks in source redactionApplied : boolean? -- whether redaction was applied timestamp : string -- ISO 8601 extraction timestamp The contentHash enables tracing entries back to source documents. When content is chun- ked, chunkIndex and chunkTotal record position for reconstruction. Entity scoping. All operations are partitioned by orgId. Within an organization, retrieval is scoped to entities using CRM keys: CRMKeys: recordId : string? email : string? websiteUrl : string? phoneNumber : string? customIdentifiers : map? Entity types are open-ended: contacts, companies, deals, vendors, partners, devices, loca- tions, and content assets share the same mechanisms. Governed Memory: A Production Architecture for Multi-Agent Workflows19 B Security, Privacy, and Content Redaction Security posture. Data is encrypted at rest and in transit. Organization-level partition keys provide hard tenant isolation at the storage layer. Governance variable visibility (organization, private, adminsOnly) and access levels (readOnly, cloneable, editable) are enforced at the API layer. All operations are logged with timestamps, user identifiers, and operation metadata. Two-phase content redaction. A redaction pipeline scrubs PII and secrets before and after LLM extraction: Phase 1 (Pre-Extraction). Raw text is scanned for sensitive patterns; matches are replaced with typed placeholders, ensuring the LLM never sees original values. Phase 2 (Post-Extraction). Extracted values are scanned again, catching cases where the LLM reconstructs PII-like patterns from contextual cues. Entity detection is organized into four tiers: Tier CategoryEntity TypesDetection 1SecretsAPI keys, private keys, passwords Pattern matching 2Financial PII Credit cards, IBANRegex + Luhn 3Identity PII Social Security NumbersRegex + validation 4Contact PII Emails, phones, IPsFormat matching Three anonymization strategies are supported: redact (typed placeholder), mask (preserve last 4 digits), hash (SHA-256 prefix for linkability without reversibility). Algorithm 1: Content Redaction Input: text (string), config (RedactionConfig) Output: redactedText (string), audits (RedactionAudit[]) 1. for each entityPattern in ENTITY_PATTERNS: 2. if entityPattern.tier not enabled in config: continue 3. if entityPattern is EMAIL and config.skipEmails: continue 4. if entityPattern is PHONE and config.skipPhones: continue 5. matches <- findAll(text, entityPattern.regex) 6. for each match in matches: 7. if entityPattern.validate and not validate(match): 8. continue 9. text <- replace(text, match, applyStrategy(match, config.strategy)) 10. count += 1 11. if count > 0: audits.append(tier, entityType, count) 12. return text, audits C Algorithm Descriptions This appendix summarizes the core algorithms referenced throughout the paper. Detailed algorithmic specifications, experiment protocols, and synthetic datasets are publicly available at https://github.com/personizeai/governed-memory. Algorithm 2: Dual Extraction Pipeline. The pipeline proceeds in nine stages: (1) op- tional pre-extraction PII redaction; (2) content chunking with content-type-specific overlap pa- rameters (dialogue, transcript, and document modes); (3) per-chunk embedding-based property selection from the organizational schema; (4) per-chunk dual LLM extraction producing both open-set atomic facts and typed property values in a single call; (5) post-extraction redaction Governed Memory: A Production Architecture for Multi-Agent Workflows20 scan; (6) cross-chunk deduplication prioritizing highest-confidence properties and normalized- text deduplication for facts; (7) quality gate computation (coreference, self-containment, tempo- ral anchoring); (8) batch embedding generation with provenance attachment; and (9) write-side deduplication against the existing store using cosine similarity thresholds. Algorithm 3: Background Consolidation. A two-phase process: (1) merge near- duplicate memories using a deliberately higher similarity threshold than write-side deduplication to minimize false merges; (2) prune stale memories beyond a configurable retention window, with sole-memory protection ensuring no entity is left with zero open-set memories. Algorithm 4: Embedding Pre-Filter. Reduces the governance variable candidate set before LLM-based routing. Candidates without embeddings pass through unconditionally. Re- maining candidates are scored by cosine similarity against the task embedding; those exceeding a minimum score threshold or ranking in the top-K are retained. Algorithm 5: Tiered Governance Routing. Two paths depending on resolved mode. Fast path: the task message is embedded and scored against all candidates using a composite of embedding similarity and keyword overlap against variable metadata and synthetic queries; always-on variables receive unconditional inclusion; results are partitioned into critical and sup- plementary sets; session state is consulted to exclude already-delivered variables before returning compiled context. Full path: an embedding pre-filter reduces the candidate set, then an LLM performs multi-step structured analysis classifying each variable as critical or supplementary with section-level precision; fallback promotion applies if no critical selections result. Both paths return a compiled critical context block and supplementary metadata. Algorithm 6: Reflection-Bounded Retrieval. An iterative loop bounded by a config- urable maximum round count. Each round: (1) an LLM judges evidence completeness against the original query at low temperature; (2) if incomplete, the LLM generates targeted follow-up queries at moderate temperature; (3) follow-up queries are embedded and used for additional vector searches; (4) results are merged by identifier across rounds. The loop terminates when the completeness check passes or no further follow-up queries are generated. Algorithm 7: Automated Schema Refinement Pipeline. A three-phase process: (1) extraction replay producing baseline results; (2) per-property analysis classifying each prop- erty as extracted, missed, low-confidence, inaccurate, or unavailable, with structured improve- ment instructions; (3) parallel per-property optimization producing revised definitions with change annotations. The three-phase design separates objective data, diagnostic judgment, and targeted fixes. D Background Consolidation Details Merge threshold selection. The merge threshold (0.95 cosine similarity) is set higher than write-side deduplication (0.92). At write time, the system errs toward preventing duplicates; during consolidation, the system errs toward preserving distinct memories, since false merge cost exceeds near-duplicate retention cost. Sole-memory protection. The prune phase ensures no entity is left with zero open-set memories. If all memories for an entity are older than the retention cutoff, the most recent is preserved. Operational design. Organizations below a minimum memory count (default: 10) are skipped. A dry-run mode logs merge/prune decisions without executing deletions. An optional compaction step triggers storage compaction one hour after consolidation. E Governance Routing Details Governed Memory: A Production Architecture for Multi-Agent Workflows21 Stage 2: LLM Multi-Step Structured Selection The LLM performs a four-step structured analysis: (1) Task Understanding, restating the ob- jective and implicit requirements; (2) Quality Dimension Identification, determining which di- mensions matter (tone, compliance, structure, etc.); (3) Task Refinement, rewriting the task as a precise instruction; (4) Selection and Prioritization, for each variable, specifying priority (critical/supplementary), mode (full/section), and reasoning. Fallback promotion. If no critical selections are returned but supplementary exist, the top two supplementary items are promoted. Section-Level Extraction When mode “section” is selected, only requested sections are extracted using heading hierarchy boundaries. If a heading is not found, full content is delivered as fallback. AI-Assisted Governance Authoring Generation from intent. The system generates governance content from variable name and description alone, producing actionable, structured content (criteria with bullet points, guidelines with sections, templates with placeholders, procedures with steps). Iterative refinement. Users refine content through natural-language feedback, preserving format while incorporating changes (adding content, revising tone, tightening criteria). This pattern applies symmetrically to property schema definitions. F Extended Experimental Results Memory Density and Output Quality (E2) DensityAvg Recalled Score /100 Mem Use /30 Sparse (0 memories)069.319.0 Minimal (3)386.028.0 Light (7)788.028.5 Moderate (12)1284.426.0 Rich (20)2085.227.0 Full (30)3088.329.5 Zero entity memory produces measurably lower output quality (69.3/100). The first three memories provide a +24% relative quality jump (69.3 → 86.0). Quality plateaus around light density (7 memories, 88.0) with diminishing returns beyond, approximately 7 high-signal governed memories are sufficient to reach near-peak personalization quality in this evaluation setting. Governed Memory: A Production Architecture for Multi-Agent Workflows22 Progressive Delivery Savings (E4) Step TaskWithout With Savings 1Cold outreach9,3129,3120% 2Follow-up w/ pricing6,80295985.9% 3Support escalation2,8732,8730% 4Troubleshooting2,6981,358 49.7% 5Closing proposal9,48298289.6% Total31,167 15,484 50.3% Savings are topic-dependent: re-entrant steps (same governance domain already loaded) achieve 50–90% savings; steps entering new domains require fresh context and save 0%. The 50.3% overall savings reflects a realistic mixed-domain workflow. Write-Side Deduplication (E6) Across five overlapping sources for a single entity, the system stored 33 unique memories while skipping 162 duplicates (83.1% dedup rate) with zero false positives. Near-miss facts (semantically similar but factually distinct) were correctly preserved. G Schema Worked Example and Diagnostic Patterns Worked Example: Per-Property Refinement Before refinement: Property: "Technology Stack" type: text description: "The company’s technology" Phase 2 classifies this as low_confidence, the description is too vague, producing inconsis- tent extraction. After refinement: Property: "Technology Stack" type: text description: "The primary technology infrastructure used by the company, including programming languages (e.g., Python, Java), frameworks (e.g., React, Django), cloud platforms (e.g., AWS, Azure), and databases (e.g., PostgreSQL, MongoDB). Focus on technical stack decisions rather than product or SaaS tool usage." Governed Memory: A Production Architecture for Multi-Agent Workflows23 Diagnostic Patterns Table 8. Recurring diagnostic patterns from evaluation records. Score PatternInterpretationIndicated Action Low Context Util., high Completeness Agent succeeded despite rout- ing issues Improve governance meta- data High Context Util., low Completeness Appropriate context but in- sufficient memory Improve memory coverage Low Personaliza- tion, high Accuracy Entity memories sparse or not recalled Check density; review re- call Low across all crite- ria Model or prompt issueReview model & system prompt Highvariance within criterion Schema-data alignment issue Refine low-scoring types Evaluation Rubric Presets Preset Criteria (weight) Default Accuracy (25), Relevance (25), Completeness (25), Context Util. (25) SalesPersonalization (30), Value Prop. (25), CTA (20), Tone (25) Support Problem Understanding (25), Solution Accuracy (30), Clarity (25), Empathy (20) Research Thoroughness (30), Source Quality (25), Analysis (25), Organization (20)