← Back to papers

Paper deep dive

A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations

Joshua Castillo, Ravi Mukkamala

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 40

Abstract

Abstract:The first 72 hours of a missing-person investigation are critical for successful recovery. Guardian is an end-to-end system designed to support missing-child investigation and early search planning. This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction and processing related to missing-person search operations. The pipeline coordinates end-to-end execution across task-specialized LLM models and invokes a consensus LLM engine that compares multiple model outputs and resolves disagreements. The pipeline is further strengthened by QLoRA-based fine-tuning, using curated datasets. The presented design aligns with prior work on weak supervision and LLM-assisted annotation, emphasizing conservative, auditable use of LLMs as structured extractors and labelers rather than unconstrained end-to-end decision makers.

Tags

ai-safety (imported, 100%)csai (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/13/2026, 12:56:59 AM

Summary

Guardian is an end-to-end, consensus-driven multi-LLM pipeline designed to assist in missing-person investigations by performing structured information extraction, summarization, and weak labeling. The system utilizes a multi-model architecture where task-specialized LLMs generate candidates that are reconciled by a centralized consensus engine to ensure reliability, schema conformity, and auditability, particularly during the critical first 72 hours of a search.

Entities (6)

Guardian · system · 100%Guardian LLM Pipeline · software-architecture · 98%Consensus Engine · component · 95%QLoRA · fine-tuning-technique · 95%Llama-3.2-3B-Instruct · llm-model · 90%Qwen2.5-3B-Instruct · llm-model · 90%

Relation Signals (4)

Guardian utilizes Guardian LLM Pipeline

confidence 100% · This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction

Guardian LLM Pipeline employs Consensus Engine

confidence 95% · The consensus engine (Figure 3) is the primary reliability mechanism in Guardian Core.

Guardian LLM Pipeline integrates QLoRA

confidence 95% · Guardian integrates QLoRA-based fine-tuning to improve role-specific performance

Consensus Engine resolvesdisagreementsbetween Qwen2.5-3B-Instruct

confidence 90% · It resolves conflicts between predictions generated by Qwen and Llama

Cypher Suggestions (2)

Identify models used by the pipeline · confidence 95% · unvalidated

MATCH (p:Pipeline)-[:EMPLOYS]->(m:Model) RETURN m.name

Find all components of the Guardian system · confidence 90% · unvalidated

MATCH (s:System {name: 'Guardian'})-[:HAS_COMPONENT|UTILIZES]->(c) RETURN c

Full Text

39,203 characters extracted from source content.

Expand or collapse full text

A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations Joshua Castillo and Ravi Mukkamala ∗[0000−0001−6323−9789] Old Dominion University, Norfolk VA 23529, USA jcast046,rmukkama@odu.edu Abstract. The first 72 hours of a missing-person investigation are crit- ical for successful recovery. Guardian is an end-to-end system designed to support missing-child investigation and early search planning. This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction and process- ing related to missing-person search operations. The pipeline coordinates end-to-end execution across task-specialized LLM models and invokes a consensus LLM engine that compares multiple model outputs and re- solves disagreements. The pipeline is further strengthened by QLoRA- based fine-tuning, using curated datasets. The presented design aligns with prior work on weak supervision and LLM-assisted annotation, em- phasizing conservative, auditable use of LLMs as structured extractors and labelers rather than unconstrained end-to-end decision makers. Keywords: Consensus-based decision making· Intelligent decision sup- port systems· Large Language Models· Multi-model integration. Research Track 001: APPLIED ARTIFICIAL INTELLIGENCE AND DATA SCIENCE 1 Introduction Missing-child search planning is a complex, multidisciplinary process that re- quires the coordinated integration of information, expertise, and resources from multiple stakeholders. Effective search operations typically synthesize heteroge- neous inputs, including last-known-position estimates, environmental and ter- rain data, weather conditions, sensor observations, and behavioral or mobility models of the missing subject [14]. These inputs are contributed by search-and- rescue (SAR) coordinators, field teams, subject-matter experts, data analysts, and increasingly by computational decision-support systems. The effectiveness of a missing-search operation therefore depends not only on the accuracy of individual data sources, but on the structured fusion of multi-party inputs to support timely, informed decision-making under uncertainty [23]. Missing-child investigations begin with incomplete, rapidly evolving infor- mation and severe time constraints, particularly during the first 72 hours. Tra- ditional early-stage search planning relies heavily on human judgment, coarse arXiv:2603.08954v1 [cs.AI] 9 Mar 2026 2J. Castillo and R. Mukkamala heuristics, and manual fusion of heterogeneous sources such as narrative reports, PDFs, public tips, transit data, and maps [11]. In practice, the core challenge is not simply to predict a single location, but to produce calibrated uncertainty and actionable search products—such as prioritized regions and time-dependent likelihood surfaces—under extreme data sparsity [23,24]. Fig. 1. Guardian System Architecture with two distinct but interconnected systems Guardian addresses these challenges through an end-to-end decision-support pipeline that converts raw, unstructured case documents into probabilistic search surfaces over a geographic grid and a set of human-interpretable artifacts, in- cluding ranked sectors, hotspots, and containment rings for 24-, 48-, and 72-hour horizons. Guardian is organized as a two-stage system (Figure 1). In Stage 1, the data preprocessing phase (Guardian Parser Pack), the system ingests heteroge- neous raw inputs, normalizes and validates extracted fields, and enriches cases with external contextual data. In Stage 2, the analysis and evaluation phase (Guardian Core), the system performs structured validation, case generation, LLM-based processing with consensus, clustering and hotspot formation, proba- bilistic forecasting of ring-and-likelihood zones, search-plan generation, and plan evaluation. All outputs are designed to be auditable and consumable by investi- gators without requiring exposure to internal model mechanics. In this paper, we focus on Guardian Pipeline, an LLM pipeline for informa- tion extraction and processing of the Guardian Core system. Given the time- sensitive nature of the application, rather than treating any single model as authoritative, Guardian treats each model as a fallible expert and routes all generated (predicted) information through a centralized, multi-model consensus layer. Thus, reliability is framed as a systems property. In Guardian, reliabil- A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations3 ity is not treated as a single scalar score but as an operational property of the pipeline. More specifically, reliability refers to the degree to which the system produces valid, correct, and consistent outputs under model disagreement, mal- formed generations, and partial failures. In this paper, that notion includes struc- tural correctness (for example, parseable and schema-aligned outputs), factual correctness relative to available ground truth, and cross-model consistency after normalization, repair, and consensus. The pipeline produces concise investigator summaries, schema-aligned extractions, and weak (noisy or probabilistic) labels that remain traceable to the underlying narrative and are suitable for down- stream integration with hotspot detection, mobility forecasting, and geospatial planning components, which themselves demand spatially meaningful evaluation beyond generic accuracy metrics [1,15,17]. The remainder of the paper is organized as follows. Section 2 summarizes related work. Section 3 provides details of different components of the proposed LLM consensus architecture. Section 4 describes the LLM promptings and their governance in this system. Section 5 provides a brief summary of the role of QLoRA in fine-tuning the models. In section 6, details of a qualitative evaluation of the system are provided. Section 7 has a discussion on the overall system along with its limitations. Finally, section 8 provides a summary and plans for future work. 2 Related Work Guardian draws methodological motivation from four intersecting research areas: missing-person decision support systems, unstructured document understanding, weak (noisy, incomplete, or probabilistic) supervision and scalable labeling, and mobility-oriented geospatial modeling. Work in missing-person analytics and search optimization highlights the im- portance of fusing diverse data streams and generating actionable prioritizations rather than raw predictions [23, 24]. Complementary literature in search and rescue demonstrates that probabilistic mapping and agent-based reasoning can convert sparse observations into interpretable density estimates [10, 13], while GIS-based mobility algorithms formalize how movement patterns can be lever- aged for planning [21]. For unstructured document analysis, surveys emphasize robust pipelines that normalize messy multi-format inputs into structured, analyzable representations, aligning with Guardian’s emphasis on schema-first processing and validation [18]. In narrative-to-structure transformation, earlier work on entity extraction from police reports demonstrates the long-standing value of converting noisy narrative text into discrete entities and relations as a foundation for downstream analysis [5]. Our Guardian system extends this direction by using LLMs for structured extraction while constraining outputs through schema validation and consensus- based reliability controls. Research on weak supervision and label generation, where scalable training data is created by combining multiple noisy signals rather than relying solely 4J. Castillo and R. Mukkamala on manual annotation [22], is the primary motivation for us to employ it in the Guardian system. Recent work demonstrates that language models themselves can supply supervision signals for text classification, enabling learning under limited labeled data [28]. Within Guardian, LLMs are positioned as controlled labelers whose outputs are audited, cross-compared, and merged through con- sensus, consistent with findings that LLM-based annotation is particularly valu- able when applied conservatively and validated [6]. Finally, Guardian’s broader geospatial setting is informed by mobility forecasting literature and the recogni- tion that spatial reasoning must be evaluated with spatially meaningful metrics, not solely generic accuracy measures [15,17]. Because missing-person workflows implicate sensitive data, Guardian’s design also aligns with ethical guidance advocating transparency, accountability, and risk-aware deployment of AI, especially where privacy and harm are salient [4, 12, 14]. Where synthetic data is employed for development and training, prior work highlights both its utility and the need to manage domain shift and privacy considerations [20,25]. 3 System Architecture Guardian’s LLM pipeline (Figure 2) begins with an entry module (Case Narra- tive) that is a primary entry point for end-to-end execution across multiple cases. This module is responsible for loading cases, constructing narratives, initializing concurrency controls, and calling the pipeline in either a stage-by-stage mode or a case-by-case mode. Stage-by-stage processing is a throughput-oriented strategy in which the sys- tem runs all cases through the summarization stage, then all cases through ex- traction, and then all cases through weak labeling. This reduces repeated initial- ization overhead, improves cache locality, and enables more consistent resource utilization, a design objective aligned with classic algorithmic and systems con- cerns about batching and efficient scheduling [7]. Case-by-case processing, by contrast, is retained for debugging and inter- active analysis where immediate end-to-end visibility is more valuable than throughput. 3.1 Centralized Consensus Layer The consensus engine (Figure 3) is the primary reliability mechanism in Guardian Core. Its purpose is not merely to “choose a better answer,” but to enforce in- variants: schema conformity, factual supportability relative to candidates, deter- ministic structure, and controlled behavior under disagreement. The consensus process begins with normalization, which transforms each can- didate into a comparable representation. For summaries, normalization enforces a fixed bullet structure and strips formatting idiosyncrasies so that agreement can be measured meaningfully. For extraction, normalization parses JSON us- ing multiple strategies and coerces fields into schema-defined types, inserting A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations5 Fig. 2. Guardian LLM Pipeline Overview and Consensus Routing Fig. 3. Centralized Consensus Mechanism and Conflict Resolution Workflow 6J. Castillo and R. Mukkamala empty defaults for missing required fields. For weak labels, normalization maps free-form terms to a closed label set, constraining movement and risk to valid categories and standardizing rationales. Following normalization, the engine performs agreement scoring. Agreement is computed at the field level rather than solely at the surface-text level, be- cause textual paraphrases can mask substantive discrepancies. For summaries, agreement is measured by comparing extracted informational slots such as sub- ject identity, last-seen location and time, vehicle references, and movement cues, using token overlap and constrained similarity thresholds. For extraction, agree- ment compares key schema fields and treats lists as order-invariant sets where appropriate. This strategy is motivated by the observation that stable down- stream analytics depend on stable structured fields, not on stylistic similarity. The consensus layer also supports targeted repair when parsing fails. If ex- traction candidates contain malformed JSON, the system attempts recovery through deterministic extraction of JSON-like substrings and, only if necessary, invokes a repair prompt that instructs a model to return valid JSON matching the schema. Even in repair mode, the system validates repaired outputs and constrains them to the schema’s required fields, reflecting the broader practice in unstructured document pipelines of combining statistical methods with rule- based normalization and strict validation [18]. Finally, it emits trace artifacts that record whether the referee was called, which agreement thresholds were met, and which fields were repaired or re- verted. These traces are crucial for auditability and for iterative improvement of prompts, validators, and downstream evaluation metrics, consistent with calls for transparent and responsible AI deployment [12,14]. 3.2 Backend Abstraction The backend module isolates the pipeline from the heterogeneity of model providers and deployment modes. The backend layer implements three main responsibili- ties: transport, resiliency, and policy. Transport standardizes how prompts are packaged and sent, including mes- sage formats for chat-style endpoints and parameter normalization. Resiliency encompasses retry policies with exponential backoff and jitter for transient fail- ures, as well as strict timeouts to prevent indefinite blocking, a critical require- ment in time-sensitive investigative workflows. Policy covers rate limiting and caching decisions that protect both cloud quotas and local GPU resources. 3.3 Orchestration and Concurrency The orchestrator module coordinates parallel execution across models and tasks, managing deadlines, concurrency limits, caching interactions, and the handoff into consensus. Orchestration is central in a consensus-first design because each case triggers multiple model calls per task; without careful scheduling, the system can overload compute resources, exceed rate limits, or create cascading delays A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations7 that undermine operational utility. Guardian therefore treats orchestration as both a performance layer and a reliability layer. In addition to concurrency, orchestration implements deadline management at the case and stage level. Each task receives a deadline budget, and the or- chestrator continually checks remaining time before escalating to more expensive operations such as referee adjudication. If time is insufficient, the system returns the best validated candidate available rather than attempting a late referee call that could time out and yield no result. This time-aware behavior supports the investigative reality that timely, conservative outputs are often more valuable than delayed, potentially more refined outputs. Orchestration also integrates caching to reduce redundant computation. When a case has been processed previously, the orchestrator can return cached results for a given stage, provided the caller has not requested a forced recomputation. Finally, the orchestrator defines the precise ordering by which candidate gen- eration, normalization, consensus adjudication, validation, and persistence occur. This ordering matters because downstream analytics, i.e., clustering, mobility forecasting, and search-zone generation, depend on the stability and complete- ness of extracted fields [9,15]. 3.4 Zone QA Zone QA (Quality Assurance) module extends the Guardian pipeline into the search-zone (a specific, defined, and delineated geographic area) domain by ap- plying LLM-assisted plausibility scoring to candidate zones and reweighting zone priorities under an explicit reward configuration. Operationally, this module loads cases that include zone candidates and their associated metadata, option- ally loads auxiliary zone scores (such as normalized RL scores), and processes cases in batches to compute plausibility assessments. The Zone QA output is not a free-form narrative judgment; it is a con- strained score that is mapped from validated label outputs and then combined with existing zone signals using a transparent formula. The recomputation func- tion explicitly weights original zone priority, LLM-derived plausibility, zone size penalties, and optional RL-derived scores, producing a bounded priority via a sigmoid transformation. This module is also designed to be safe under imperfect information. When the labeler fails or returns low-confidence results, Zone QA defaults plausibil- ity to a neutral score rather than producing extreme penalties or boosts. This conservative behavior reflects ethical guidance for minimizing harm when auto- mated systems operate on sensitive data, especially where false confidence could misdirect search resources [12,14]. 4 LLM Prompting System and Template Governance The large language model (LLM) prompting architecture in Guardian is inten- tionally designed as a reliability-oriented mechanism rather than a mere usabil- ity or abstraction layer. This design choice aligns with the prevailing research 8J. Castillo and R. Mukkamala Fig. 4. Prompt Governance and Template-Based LLM Interaction consensus that language models exhibit superior performance when assigned narrowly scoped, well-specified roles governed by explicit output contracts and constraints [5,6,16]. In Guardian, a prompt is treated as a first-class system artifact: it is the explicit, inspectable interface through which an investigative narrative is trans- formed into a bounded task request for an LLM backend. This framing is essential because missing-person investigations depend on converting noisy, incomplete narratives into structured, auditable intelligence under time pressure [23,24]. To achieve this, Guardian distinguishes three operational prompts: task prompts, consensus prompts, and format-guard prompts. Task prompts generate primary artifacts such as investigator summaries, schema-aligned entity extractions, and weak labels for movement and risk (Figure 4). Consensus prompts (or referee prompts) are invoked only when candidate outputs disagree or violate struc- tural requirements, triggering adjudication, tie-breaking, or repair. Format-guard prompts are not a separate execution stage but a recurring design pattern in which prompts embed explicit contracts, such as “return JSON only,” fixed key sets, enumerated label spaces, and constrained bullet schemas. These ensure that outputs remain machine-actionable, comparable across models, and stable across repeated runs [18]. Beyond the orchestrator-driven pathway, Guardian includes role-specific lo- cal modules with prompts optimized for robustness in single-model contexts and for model-specific chat formats. The local summarizer prompt is intention- ally minimal and imperative, requiring exactly five short bullets and forbidding commentary. This strict prompt contract, combined with deterministic post- processing, extracting bullet-like lines and enforcing a hard stop. This ensures A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations9 stable summaries even when models emit extraneous text. The extractor module supports multiple prompt styles aligned with different reliability strategies. Weak-labeling prompts adopt a similar dual strategy of explicit constraints and model-aligned formatting. The weak-labeler module includes a chat-formatted strict JSON prompt with explicit system and user role headers and an exact JSON example. The consensus layer introduces a distinct family of prompts whose purpose is not to generate primary artifacts from narratives but to reconcile candidate out- puts produced by other models under strict constraints. These prompts fall into three principal categories: JSON repair prompts, referee adjudication prompts, and explicit tie-break prompts. Repair prompts are invoked only after deter- ministic parsing and recovery strategies fail. Referee adjudication prompts are used when multiple candidates are structurally valid yet disagree on material fields. These prompts instruct the adjudicator to select between candidate val- ues or merge compatible ones while explicitly prohibiting invention of facts not present in the candidates. Tie-break prompts further narrow the adjudicator’s scope by focusing on a single contested field and requiring JSON-only output with enumerated labels and a brief rationale. Summarization referee prompts adopt an additional efficiency strategy by using compact narrative prefixes and fixed five-line bullet schemas, reserving token budget for structured output while maintaining a stable contract for normalization. Taken together, the prompting system in Guardian is best understood as an ecosystem of governed templates, controlled construction sites, and role-specific constraints that collectively enable deterministic validation and multi-model con- sensus. 5 QLoRA-Based Fine-Tuning Integration Guardian integrates QLoRA-based fine-tuning to improve role-specific perfor- mance while preserving scalability and multi-model flexibility. It enables role specialization by updating only a small set of low-rank adapter parameters on top of a quantized base model, preserving much of the base model’s general language competence while making training feasible on modest hardware [8]. The fine-tuning workflow is designed around curated training inputs that reflect Guardian’s operational tasks. A key design decision is that fine-tuned models are integrated as interchange- able backends behind the same interface. This allows the pipeline to treat “fine- tuned Qwen extractor” and “fine-tuned Llama extractor” as peers generating candidates that are subsequently adjudicated by the consensus LLM. Such inte- gration preserves the core principle that reliability is achieved through consensus, while fine-tuning improves candidate quality and reduces the burden on repair and re-ranking mechanisms. This is particularly consistent with cautionary find- ings that LLMs’ evaluation and judging capabilities can be limited. Accordingly, Guardian uses fine-tuned specialists to produce better candidates and uses struc- tured consensus and validation to decide what is safe to accept [16]. 10J. Castillo and R. Mukkamala 6 Evaluation We have deployed Guardian as a distributed Google Cloud configuration consist- ing of three separate GPU virtual machines (VMs), each dedicated to a single task role: an extractor VM, a summarizer VM, and a weak-labeler VM, re- spectively. On each VM, two models run concurrently, a Qwen2.5-3B-Instruct model and a Llama-3.2-3B-Instruct model. Each model is served via a Docker- ized vLLM server exposing an OpenAI-compatible API. Operationally, the Qwen server binds to port 8001 on each VM and the Llama server binds to port 8002 on each VM, producing six total inference servers running in parallel across the three machines. This separation of concerns allows each VM to be tuned for its workload while supporting consistent invocation semantics across roles. The outputs of the individual models are integrated within a consensus layer implemented using Gemini 2.5 Flash/Pro. This layer is employed for consensus formation, adjudication among model outputs, and automated JSON structure repair (via Vertex AI). It resolves conflicts between predictions generated by Qwen and Llama, corrects malformed or nonconformant JSON responses, and can optionally assist in the normalization phase. It is not involved in initial summarization, information extraction, or weak-label generation. Figures 5 and 6 present representative JSON output excerpts produced by the Llama LLM and by the Gemini-based consensus LLM, respectively. Fig. 5. Excerpt from the Llama LLM JSON Output A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations11 Fig. 6. Excerpt from the Gemini Consensus LLM JSON Output In the present study, reliability is assessed operationally rather than through a single aggregate metric. At the structural level, we examine whether outputs are parseable, schema-aligned, and repairable when malformed. At the factual level, we compare extraction and weak-label outputs against structured synthetic case ground truth where available. At the system level, we consider agreement across candidate models, the need for referee intervention, fallback behavior, and the extent to which the pipeline continues to produce stable, auditable outputs under disagreement and failure. Guardian’s evaluation emphasizes reliability, structural validity, and failure- mode behavior, rather than benchmark-style predictive accuracy. This focus re- flects both the safety-critical nature of missing-child investigations and the well- documented lack of complete ground truth in real-world disappearance cases, particularly during early-stage response [11,23,24]. In such contexts, prior work argues that the primary risk is not marginal predictive error but the propagation of unstable or unsupported inferences into operational decision-making [2,4]. Ac- cordingly, the evaluation is framed as a qualitative, diagnostic analysis of pipeline behavior under realistic operating conditions, grounded in direct inspection of live model outputs produced during system execution. The evaluation corpus consists of synthetic and semi-structured missing-child case narratives processed end-to-end through the Guardian LLM Pipeline. Syn- thetic data is used to introduce controlled variation while avoiding exposure 12J. Castillo and R. Mukkamala of sensitive real cases. This is consistent with established practices for devel- oping and stress-testing AI systems in domains with privacy and availability constraints [20,25]. For each case, the pipeline produces three primary artifacts: a concise investigator-facing summary, a schema-aligned structured extraction, and a weak-label output capturing movement and risk signals. Outputs are exam- ined at both the candidate level (individual model responses) and the consensus level (final canonical outputs after normalization, validation, and adjudication). This two-level inspection aligns with prior research showing that understanding how errors arise in intermediate representations is essential for reliable down- stream analytics in unstructured document pipelines [5,18]. Inspection of raw single-model outputs reveals several recurring failure modes that directly motivate Guardian’s architecture. In weak labeling, individual mod- els frequently exhibit overconfident classification, assigning high-risk or long- range movement labels even when narrative evidence is ambiguous or incom- plete. This tendency is consistent with recent findings that LLMs often produce confident categorical outputs despite uncertainty, particularly when asked to act as implicit judges without explicit constraints [6,16]. In summarization, models often generate readable but speculative state- ments, such as inferred offender intent or implied vehicle usage not explicitly supported by the narrative, a pattern that echoes longstanding concerns about narrative interpretation in child-abduction contexts where unsupported assump- tions can distort risk perception [2, 19]. In extraction tasks, malformed or par- tially invalid JSON outputs appear regularly, especially when narratives are long or noisy, reflecting broader challenges in unstructured document analysis where schema compliance cannot be assumed without explicit validation [3,18]. When Guardian’s normalization and consensus mechanisms are applied, out- puts become systematically more conservative and structurally consistent. Nor- malization enforces closed vocabularies, fixed schemas, and deterministic pars- ing rules, collapsing free-form language into representations suitable for down- stream processing and analysis [1]. The centralized consensus layer further re- duces volatility by comparing candidate outputs across models, suppressing un- supported fields, and reconciling disagreements only within the bounds of ob- served evidence. This approach operationalizes insights from weak supervision and multi-source labeling research, which show that aggregating multiple noisy signals under explicit constraints yields more reliable supervision than trusting any single source [22, 28]. In cases where models disagree, consensus outputs consistently favor interpretations that are explicitly grounded in the narrative, rather than those implied by model confidence alone, aligning with recent cri- tiques of unconstrained LLM self-evaluation and judging [16]. Although this evaluation does not claim quantitative improvements in in- vestigative outcomes or recovery rates, it demonstrates that Guardian reliably converts unstructured, narrative-driven inputs into auditable, schema-aligned artifacts that remain stable across repeated runs. These properties are essen- tial for downstream geospatial modeling, clustering, and search-zone generation, which depend on consistent structured inputs rather than stylistically variable A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations13 text [9, 15, 17]. In this sense, the evaluation supports the central thesis of this work: in missing-child and other safety-adjacent domains, reliability must be enforced at the system level through validation, normalization, and consen- sus, rather than assumed from individual model outputs or generative confi- dence [12,14]. 7 Discussion Guardian’s architecture treats reliability as a systems property. The primary contribution is not any single model’s accuracy, but the end-to-end design in which multi-model candidate generation is forced through validation and con- sensus before outputs are accepted. This approach is particularly appropriate for missing-person and child-safety contexts, where narratives are incomplete and high-stakes decisions must remain auditable. The consensus layer operationalizes a pragmatic view of LLMs: they are powerful pattern extractors and summariz- ers, but they should be constrained, cross-checked, and integrated as components within a larger verification pipeline rather than treated as authoritative. This is consistent with research arguing that LLMs are often most effective as structured annotators that generate useful supervision and intermediate artifacts, especially when human oversight and validation mechanisms are present [6]. The QLoRA integration demonstrates how parameter-efficient fine-tuning strengthens candidate quality without undermining scalability.By adapting fewer than 1% of parameters, Guardian can train role-specific specialist models while preserving the feasibility of inference across multiple models and roles. The observed training dynamics—specifically, monotonic reductions in loss, stabi- lizing gradient norms, and well-behaved learning-rate schedules—indicate that the models are internalizing the constrained behaviors required for operational deployment. Within a consensus-first framework, these improvements exhibit multiplicative downstream effects: higher-quality candidate outputs reduce the need for post hoc repair, diminish ambiguity during adjudication, and increase the proportion of instances in which models concur, thereby enhancing system stability and lowering overall computational expenditure. The reproducibility and deployment design highlights a practical lesson: multi-model consensus requires not only algorithmic strategies but also robust infrastructure. By distributing roles across dedicated GPU VMs, serving models through standardized APIs, and tunneling endpoints securely to the coordinator, Guardian makes multi-model inference both scalable and operationally manage- able. This infrastructure approach supports privacy and safety expectations by minimizing exposure and enabling controlled access, consistent with broader concerns about responsible AI use in sensitive investigative domains [12,14]. 7.1 Limitations We should also note some of the primary limitations of our system. First, Guardian’s consensus strategy increases computational cost relative to single- model pipelines because each task is replicated across multiple models by design. 14J. Castillo and R. Mukkamala While orchestration and parallel execution mitigate latency, resource consump- tion remains a tradeoff. This design is chosen intentionally because operational stability is prioritized. A second limitation concerns ground-truth scarcity in missing-person con- texts. Weak labels and extraction outputs can be validated structurally, but se- mantic correctness can be difficult to confirm without authoritative datasets. Guardian therefore emphasizes auditable traces and conservative acceptance policies, but empirical evaluation will remain constrained by data availability and the sensitivity of investigative records [23,24]. A third limitation is that LLMs, even when fine-tuned, may be unreliable at complex spatial reasoning tasks without explicit geospatial algorithms. Guardian addresses this by restricting LLM roles to extraction, summarization, and weak labeling, and by validating geospatial outputs through Zone QA rather than trusting generative reasoning [26,27]. 8 Conclusion This paper described the Guardian LLM Pipeline, a consensus-routed, multi- model architecture designed to produce reliable, schema-aligned outputs for missing-person intelligence workflows. The system’s primary entry point mod- ule coordinates end-to-end execution, while orchestrator module enables scal- able multi-model parallelism and the backend module standardizes heteroge- neous providers behind consistent interfaces. Reliability is centered in the con- sensus module, which compares multiple model outputs, resolves disagreements through structured conflict-resolution strategies, and applies repair and verifi- cation to produce stable canonical results. Robustness is strengthened by the Zone QA module, which detects and corrects zone-level and structural issues before and after consensus evaluation. QLoRA-based fine-tuning improves can- didate quality using curated datasets composed of synthetic cases, real cases, and research-derived corpora, while updating less than 1% of model parameters to preserve scalability. Taken together, these design choices support a pragmatic view of reliability for safety-adjacent LLM systems: dependable behavior arises from well-bounded roles, explicit checking, and centralized agreement, rather than from trusting any single model on its own. References 1. Aggarwal, C.C.: Data Mining: The Textbook. Springer (2015) 2. Allen, E.E.: Keeping children safe: Rhetoric and reality. Juvenile Justice Journal 5(1) (May 1998), u.S. Department of Justice, Office of Juvenile Justice and Delin- quency Prevention 3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media (2009) 4. Budowle, B., Baker, L., Sajantila, A., Mittelman, K., Mittelman, D.: Prioritiz- ing privacy and presentation of supportable hypothesis testing in forensic genetic genealogy investigations. BioTechniques 76(9), 425–431 (2024) A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations15 5. Chau, M., Xu, J.J., Chen, H.: Extracting meaningful entities from police narrative reports. Journal of the American Society for Information Science and Technology 53(11), 984–995 (2002) 6. Chen, R., Qin, C., Jiang, W., Choi, D.: Is a large language model a good anno- tator for event extraction? In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-24). p. 17772–17780 (2024) 7. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. The MIT Press, 4 edn. (2022) 8. Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient finetun- ing of quantized large language models (2023), arXiv preprint arXiv:2305.14314 9. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discov- ering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). p. 226–231. AAAI Press (1996) 10. Ewers, R., Anderson, J., Thomson, D.: Agent-based predictive probability density mapping for search and rescue (2024), manuscript 11. Federal Bureau of Investigation: Child Abduction Response Plan: An Investigative Guide. U.S. Department of Justice, 3 edn. (2014) 12. Floridi, L., Cowls, J.: A unified framework of five principles for ai in society. Har- vard Data Science Review 1(1) (2019) 13. Hashimoto, A., Heintzman, L., Koester, R., Abaid, N.: An agent-based model re- veals lost person behavior based on data from wilderness search and rescue. Scien- tific Reports 12, 5873 (2022) 14. International Committee of the Red Cross: Balancing risks and opportunities: New technologies and the search for missing people. Tech. rep., ICRC (2025) 15. Jiang, W., Luo, J.: Graph neural network for traffic forecasting: A survey. Expert Systems with Applications 207, 117921 (2022) 16. Li, T., Qin, Y., Sheng, O.R.L.: A multi-task evaluation of llms’ processing of aca- demic text input (2025), arXiv preprint arXiv:2508.11779 17. Lyu, F.: Evaluating the evaluation matrices: Integrating spatial assessment in geospatial ai model training and evaluation. In: I-GUIDE Forum 2025 (2025) 18. Mahadevkar, S.V., Patil, S., Kotecha, K., Soong, L.W., Choudhury, T.: Explor- ing ai-driven approaches for unstructured document analysis and future horizons. Journal of Big Data 11, 92 (2024) 19. Miller, J.M., Kurlychek, M., Hansen, J.A., Wilson, K.: Examining child abduction by offender type patterns. Justice Quarterly 25(3), 523–543 (2008) 20. Nikolenko, S.I.: Synthetic data for deep learning (2019), arXiv preprint arXiv:1909.11512 21. Papić, V., Šarić Gudelj, A., Milan, A., Miličević, M.: Person mobility algorithm and geographic information system for search and rescue missions planning. Remote Sensing 16(4), 670 (2024) 22. Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endow- ment 11(3), 269–282 (2017) 23. Ruiz Reyes, J., Congram, D., Sirbu, R.A., Floridi, L.: Where are they? a review of statistical techniques and data analysis to support the search for missing persons and the new field of data-based disappearance analysis. Forensic Science Interna- tional 376, 112582 (2025) 24. Solaiman, K.M.A., Sun, T., Nesen, A., Bhargava, B., Stonebraker, M.: Applying machine learning and data fusion to the “missing person” problem. IEEE Computer 55(6), 40–55 (2022) 16J. Castillo and R. Mukkamala 25. Sun, C., van Soest, J., Dumontier, M.: Generating synthetic personal health data using conditional generative adversarial networks combining with dif- ferential privacy. Journal of Biomedical Informatics 143, 104404 (2023). https://doi.org/10.1016/j.jbi.2023.104404 26. Truong, T.H., Lau, J.H., Qi, J.: Understanding the geospatial reasoning capabilities of llms: A trajectory recovery perspective (2025), arXiv preprint arXiv:2510.01639 27. Xu, L., Zhao, S., Lin, Q., Chen, L., Luo, Q., Wu, S., Ye, X., Feng, H., Du, Z.: Evaluating large language models on spatial tasks: A multi-task benchmarking study (2025), arXiv preprint arXiv:2408.14438 28. Zeng, Z., Ni, W., Fang, T., Li, X., Zhao, X., Song, Y.: Weakly supervised text clas- sification using supervision signals from a language model. In: Findings of NAACL 2022. p. 2295–2305 (2022)