Paper deep dive

CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation

Gengxin Sun, Ruihao Yu, Liangyi Yin, Yunqi Yang, Bin Zhang, Zhiwei Xu

Year: 2026Venue: arXiv preprintArea: cs.MAType: PreprintEmbeddings: 59

Abstract

Abstract:Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), CoMAI employs a modular task-decomposition architecture coordinated through a centralized finite-state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collaboratively to provide multi-layered security defenses against prompt injection, support multidimensional evaluation with adaptive difficulty adjustment, and enable rubric-based structured scoring that reduces subjective bias. Experimental results demonstrate that CoMAI achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI-driven interview assessment.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

58,728 characters extracted from source content.

Expand or collapse full text

CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation Gengxin Sun ∗ Shandong University Qingdao, China gxin.sun@mail.sdu.edu.cn Ruihao Yu ∗ Shandong University Qingdao, China 202322130199@mail.sdu.edu.cn Liangyi Yin Shandong University Qingdao, China 202300130144@mail.sdu.edu.cn Yunqi Yang Shandong University Qingdao, China 202300130095@mail.sdu.edu.cn Bin Zhang † Institute of Automation, Chinese Academy of Sciences Beijing, China zhangbin@ia.ac.cn Zhiwei Xu ‡ Shandong University Jinan, China zhiwei_xu@sdu.edu.cn ABSTRACT Ensuring robust and fair interview assessment remains a key chal- lenge in AI-driven evaluation. This paper presents CoMAI, a general- purpose multi-agent interview framework designed for diverse as- sessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), which exhibit limited con- trollability and are susceptible to vulnerabilities such as prompt injection, CoMAI employs a modular task-decomposition architec- ture coordinated through a centralized finite-state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collabo- ratively to (1) provide multi-layered security defenses (achieving full protection against prompt injection attacks), (2) support mul- tidimensional evaluation with adaptive difficulty adjustment based on candidate profiles and response history, and (3) enable rubric- based structured scoring that reduces subjective bias. To evaluate its effectiveness, CoMAI was applied in real-world scenarios, ex- emplified by the university admissions process for talent selection. Experimental results demonstrate that CoMAI achieved 90.47% ac- curacy (an improvement of 30.47% over single-agent models and 19.05% over human interviewers), 83.33% recall, and 84.41% candi- date satisfaction, which is comparable to the performance of human interviewers. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI-driven interview assessment, with strong applicability across educational and other decision-making domains involving interviews. KEYWORDS Multi-Agent Systems, AI-Assisted Interviews, Large Language Mod- els, Prompt Injection Defense, Talent Assessment, Fairness, Elite Talent Assessment, Agent Interaction, Human-Agent Interaction, Human–AI Collaboration, Robustness 1 INTRODUCTION In the context of intensifying global competition for talent, re- cruitment and interviewing have become critical mechanisms for educational institutions and enterprises to identify high-caliber ∗ Both authors contributed equally to this research. † Corresponding author. ‡ Corresponding author. Question Generation ? Security Scoring Summarization CoMAI ... Figure 1: Overview of CoMAI, a collaborative multi-agent interview framework that orchestrates specialized agents through a centralized controller. candidates. Despite their widespread use, traditional manual inter- views suffer from inherent limitations that undermine both rigor and fairness. They rely heavily on interviewers’ subjective judg- ments, which are prone to personal biases and emotional influ- ences, thereby compromising the consistency and impartiality of outcomes. Conducting interviews on a large scale also entails sub- stantial labor and time costs, limiting efficiency and scalability. In addition, candidates’ performance is often influenced by exter- nal conditions and contingent factors, introducing randomness and instability into evaluation results. The lack of transparency in the process further makes it difficult for candidates to understand the evaluation criteria and weakens comparability across differ- ent cohorts. Moreover, traditional interviews are unable to adapt dynamically to candidates’ individual characteristics or real-time performance, thereby lacking adaptability and personalized support. Consequently, conventional interview formats frequently fall short of meeting the multifaceted requirements of elite talent assessment. Driven by the rapid advancement of artificial intelligence and large language models (LLMs) [19], AI-based interviewing systems have been introduced to meet the increasing demand for talent evaluation [29,37,41]. These systems reduce operational costs and arXiv:2603.16215v1 [cs.MA] 17 Mar 2026 provide standardized interview experiences for large numbers of candidates. However, their practical performance remains limited. Most existing approaches rely on single-agent designs, which, al- though capable of improving efficiency and ensuring a degree of objectivity, exhibit several critical shortcomings: (1) Monolithic architectures are poorly suited for concurrent usage and are vul- nerable to cascading failures when a single module malfunctions; (2) Rigid structures constrain adaptability across diverse interview scenarios, leading to weak generalization; (3) Fragmented modular designs hinder the seamless integration of evaluation components. In addition, current systems frequently misinterpret ambiguous or concise responses, often privileging verbose answers unless carefully fine-tuned [13]. Their security safeguards are also inade- quate. In particular, LLM-based systems remain highly vulnerable to prompt injection attacks [17], owing to blurred boundaries between task instructions and user-provided input. This creates substantial risks in high-stakes assessment contexts. To address the above challenges, we propose CoMAI, a Collabo- rative Multi-Agent Interview framework specifically designed for elite talent assessment, as shown in Figure 1. CoMAI organizes the interview process through specialized agents responsible for question generation, security monitoring, scoring, and summarization, all coordinated by a centralized finite-state controller (CFSC) [40]. This design departs from monolithic single-agent architectures and ensures both methodological rigor and practical applicability in high-stakes evaluation contexts. Significantly, the framework operates without requiring additional training or fine-tuning and can be readily adapted to diverse underlying models. The main contributions of this work are as follows: (1)We propose CoMAI, a scalable and resilient multi-agent ar- chitecture, to improve fault tolerance and maintain stable performance under concurrent usage. (2)A layered security strategy is incorporated to defend against adversarial manipulations such as prompt injection, ensuring robustness in sensitive assessment scenarios. (3) An interpretable and equitable evaluation mechanism is es- tablished through rubric-guided scoring with adaptive diffi- culty adjustment, balancing fairness with personalization. (4)The effectiveness of CoMAI is validated in real-world univer- sity admissions experiments, where it achieves substantial gains in accuracy, security, and candidate satisfaction com- pared to other baselines. 2 RELATED WORK 2.1 Multi-Agent Systems Multi-agent systems (MAS) [8] have long been central to research in distributed artificial intelligence and collective decision-making. Their core principle is to address complex tasks through the col- laboration of multiple autonomous agents, which can allocate sub- tasks [34], exchange information [16], and engage in collaborative reasoning [42], thereby exhibiting greater robustness and scalabil- ity compared to single-agent systems. Traditionally, multi-agent methods have been widely applied in domains such as game theory modeling [33], resource scheduling [35], traffic management [5], and collaborative robotics [25,27]. With the advent of LLMs, recent studies have explored in greater detail the applications of multi- agent frameworks to complex task settings. One representative line of work investigates multi-agent systems for open-domain dialogue and collaborative writing [11,26], where agents are as- signed complementary roles to enhance the quality and consistency of generated content. Another research direction emphasizes task decomposition and planning [28,38], in which multi-agent archi- tectures divide complex problems into sub-goals and solve them efficiently through inter-agent communication and collaboration. At the same time, social simulation [9,10] has emerged as an im- portant branch of MAS research. By constructing large numbers of virtual agents and defining interaction rules, researchers can simulate cooperation, competition, and evolutionary dynamics of social groups, thereby providing novel experimental paradigms for economics, sociology, and organizational behavior. Overall, these explorations indicate that multi-agent frameworks demonstrate significant advantages in many domains. 2.2 LLM-Driven Chatbots In recent years, chatbots and conversational agents [1,31] have emerged as a key area of research in natural language processing (NLP). Early systems were primarily rule-based [36] or retrieval- based [3], capable of answering questions or engaging in casual conversations within restricted domains, but they lacked flexibil- ity and scalability. With the advent of deep learning, end-to-end neural dialogue models have gained traction, enabling systems to learn to generate natural language responses from large-scale cor- pora [24,32]. However, these models exhibit significant limitations in semantic understanding and dialogue management, making it challenging to sustain coherence and reliability in open-domain settings. The rise of LLMs has greatly advanced the development of chatbot technology. LLM-driven chatbots demonstrate strong language generation and context modeling capabilities. Building on this progress, Retrieval-Augmented Generation (RAG) [12,15] has become a prominent approach for enhancing chatbot performance. By integrating external knowledge bases or document retrieval modules, RAG enables systems to access real-time information dur- ing dialogue, thereby mitigating hallucinations and improving both domain coverage and factual accuracy. Recent studies have further explored combining chatbots with knowledge graphs [7] and ex- ternal tools [23] to enhance their task-solving capabilities. Despite these advances, LLM-based chatbots continue to face several chal- lenges [4]. Their generation process remains difficult to control and may produce false or hallucinatory content. In safety-critical contexts, they lack robust defense mechanisms. Moreover, their reliance on monolithic architectures often constrains adaptability to specific tasks and dynamic user needs. 2.3 AI-Assisted Recruitment and Assessment Some researchers have explored LLM-driven simulated interviews frameworks to enhance the authenticity and interactivity of candi- date-job matching. For instance, MockLLM [29] introduced a multi- agent collaborative framework that simultaneously simulates both interviewers and candidates in a virtual environment, and improves candidate-job matching on recruitment platforms through a bidirec- tional evaluation mechanism. Beyond simulated interviews, AI has User's Profile Interviewee Question Generation Agent Security Agent Scoring Agent Summarization Agent Interrupt！ Database Interview Continues Interview Completed Question User's Response Prompt Attack Detected Forced End User's Personalized Report Interview Data Restored Safe Response Interview Data Transmitted Memory Interview in Progress: Generate-Answer-Evaluate Loop Interview Preparation: Read User's Profile Interview Completed Normally Interview AttackedInterview Data Restored FrontendCentralized CoordinatorDatabase Figure 2: Process overview of the CoMAI framework. The system retrieves a candidate’s resume from the database, which triggers the Question Generation agent to formulate interview questions. Responses are first screened by the Security agent; if approved, they are evaluated by the Scoring agent and archived in the internal memory. Feedback from the Scoring agent informs subsequent question generation. Upon completion of the interview, the Summary agent consolidates all information into a final report, which is stored in the database along with the raw records. also been widely applied to resume screening and automated assess- ment. Lo et al. [18] proposed an LLM-based multi-agent framework for resume screening, which leverages RAG to dynamically inte- grate industry knowledge, certification standards, and company- specific hiring requirements, thereby ensuring high adaptability and interpretability across diverse roles and domains. Similarly, Wen et al. [37] developed the FAIRE benchmark to evaluate gen- der and racial bias in LLM-based resume screening, revealing that while current models can enhance efficiency, they still exhibit per- formance disparities across demographic groups. Yazdani et al. [41] introduced the Zara system, which combines LLMs with RAG to provide candidates with personalized feedback and virtual inter- view support, thereby addressing the persistent issue of insufficient feedback in traditional recruitment processes. In parallel, Lal et al. [14] investigated the potential of AI to mitigate emotional and confirmation biases during the early stages of recruitment. However, most existing approaches still have clear limitations. Many focus only on a single component, such as resume screening or bias analysis, and lack a systematic view of the entire interview process. Others rely on single-agent architectures, which restrict role specialization and dynamic coordination. In addition, many methods provide limited protection against adversarial threats such as prompt injection. In contrast, CoMAI uses a multi-agent division of labor to split the interview into several stages, all coordinated by a centralized controller. With a layered security strategy and adaptive scoring, CoMAI balances standardization with personalization. This design improves fairness, robustness, and candidate experience, and makes the system more suitable for high-stakes talent selection. 3 SYSTEM DESIGN AND METHODOLOGY To address the challenges of achieving reliability, scalability, and fairness in interview assessment, we propose CoMAI, a modular multi-agent framework that integrates a centralized coordinator with four role-specific agents to ensure expert-level performance across diverse scenarios. The coordinator manages information flow and policy routing among the Question Generation agent, which formulates targeted questions based on the candidate’s resume; the Security agent, which detects potential anomalies in responses; the Scoring agent, which performs both quantitative and qualita- tive evaluations; and the Summarization agent, which maintains episodic memory and generates audit-ready reports. This clear division of responsibilities enables transparent, extensible, and ver- ifiable system operation. In this section, we present the design of CoMAI in detail, cover- ing: (1) the overall system architecture, (2) coordination and com- munication mechanisms governed by the centralized coordinator, (3) the functional design of the four agents, (4) the data–knowledge storage module, and (5) system deployment. Finally, we illustrate how centralized orchestration distinguishes CoMAI from single- agent or loosely coupled frameworks. The overall architecture of CoMAI is illustrated in Figure 2, and the implementation details are provided in the Appendix. 3.1 Multi-Agent Architecture To ensure structural consistency, traceability, and goal-oriented co- ordination, CoMAI employs a centralized orchestration paradigm. A central coordinator governs the entire interview lifecycle through deterministic finite-state machine (FSM), where each transition rep- resents a controlled event among agents. All modules communi- cate through standardized message-passing protocols, guaranteeing modularity, reproducibility, and minimal coupling between compo- nents. The general interview logic is encapsulated within a core pipeline, while scenario-specific adaptations are realized through parameter- ized configurations rather than code modifications. This approach preserves generality and enables fast adaptation to new domains, evaluation rubrics, or interview policies without altering the un- derlying framework. Following the principles of high cohesion and low coupling, the architecture comprises four key components: •Central Coordinator: Manages the interview lifecycle via an FSM, tracks global state variables such as interview stage and candidate progress, and orchestrates agent execution with deterministic scheduling. •Abstract Agent Protocol: Defines a unified input–output schema and message taxonomy. This protocol acts as the abstract base for all functional agents, ensuring consistent communication and plug-and-play extensibility. •Specialized Functional Agents: Implement domain-specific reasoning, including Question Generation, Security Check- ing, Scoring, and Summarization. These agents form a struc- tured reasoning chain, where each module refines or evalu- ates the outputs of the previous one. •Supporting Subsystems: Comprise a Memory Manager and Retrieval System that provide synchronized access to dynamic interview states and static candidate data (e.g., re- sumes, constraints, and evaluation rubrics). All exchanges are logged with timestamped trace identifiers, ensuring au- ditability and fairness monitoring. This architecture not only enforces a verifiable and extensible work- flow but also enables explicit traceability, modular reasoning, and responsible system governance. 3.2 Coordination and Communication A central innovation of CoMAI lies in its control–data dual-flow architecture. The control flow, managed by the coordinator, deter- mines execution order and timing through explicit state transitions, while the data flow transports structured outputs between agents, embedding reasoning traces, confidence scores, and risk assess- ments. Together, these two layers ensure deterministic coordination with adaptive and auditable data propagation. Central Coordination and State Management.The coordina- tor operates a global FSM consisting of Initialization, Questioning, First, assume the number of prime numbers is finite. We then construct a new number p=(p1×p2×p3×...×pn)+1...... Sure, can you further tell me the brief proof process using contradiction? Yes, there are infinitely many prime numbers, which can be proven by contradiction. Are there an infinite number of prime numbers? Figure 3: CoMAI dynamically asks follow-up questions to probe the interviewee’s reasoning process. Security Checking, Scoring, Summarization, and Termination. All transitions are deterministic, recoverable, and logged. When the Security agent detects high-risk inputs, the FSM switches to an In- terruption state, ensuring graceful termination and persistent data storage. This explicit control mechanism supports transparency, safety, and post-hoc verification. Communication and Security Isolation.All inter-agent com- munications are asynchronous and routed through the coordinator, preventing direct dependencies and uncontrolled transitions. Fol- lowing the principle of minimal exposure, only the Question Gen- eration and Summarization agents can access candidate resumes, while the Scoring agent operates on anonymized data to mitigate bias. This role-based access control preserves fairness, privacy, and data integrity. Each communication event is tagged with a unique session identifier, enabling traceable audits. Adaptive Feedback and Closed-loop Control. CoMAI imple- ments bidirectional feedback between the control and data layers. Scoring results guide the Question Generation agent to dynami- cally adjust question complexity and topical focus, while security assessments trigger adaptive moderation strategies or session trun- cation. This closed-loop design achieves personalized yet consistent evaluation dynamics under a controllable policy regime. Memory System for Context Management. A hierarchical memory structure underpins adaptive coordination. The Short-Term Memory (STM) stores the current session context, including active QA pairs, transient scores, and security flags, while the Long-Term Memory (LTM) maintains aggregated historical data such as ques- tion statistics, ability estimates, and final reports. The coordinator enforces version-controlled memory access and synchronization across agents, ensuring consistency for both real-time adaptation and retrospective analysis. Together, these mechanisms enable CoMAI to maintain struc- tured coordination, responsible data governance, and robust adapt- ability throughout the interview lifecycle. 3.3 Specialized Functional Agents CoMAI’s four specialized agents function as modular yet tightly integrated components, orchestrated through standardized schemas and communication protocols. Guided by a centralized coordinator, Figure 4: Categories of intercepted prompt-word attacks. they form a sequential reasoning pipeline that transforms unstruc- tured candidate input into structured, auditable evaluations. Question Generation Agent.Serving as the entry point of the reasoning pipeline, the Question Generation agent generates context- sensitive questions based on the candidate’s resume and previous answers. It adheres to predefined rules for scheduling rounds, main- taining topical diversity, and dynamically adjusting difficulty. Each output includes the full question text, difficulty level, question type, and an accompanying reasoning trace that clarifies the selection rationale. This explicit reasoning enhances interpretability and sup- ports subsequent auditability. As illustrated in Figure 3, the agent can dynamically issue follow-up questions to probe the intervie- wee’s reasoning process and progressively deepen the assessment. Security Agent. To ensure safety and compliance, the Security agent operates as an intermediary layer between user input and the scoring process. It performs both rule-based and semantic checks to identify unsafe, adversarial, or policy-violating content. The output consists of structured risk assessments alongside corresponding mitigation strategies (including issuing warnings, assigning min- imum scores, or halting the process). Reasoning logs and recom- mended actions are recorded independently to facilitate traceability and compliance auditing. As illustrated in Figure 4, the detected adversarial inputs are categorized into multiple prompt-word at- tack types, highlighting the Security agent’s ability to identify and neutralize diverse threats. Scoring Agent. The Scoring agent is responsible for evaluating candidate responses using rubric-driven decomposition. It produces both quantitative scores and qualitative feedback that assess fac- tual correctness and reasoning depth. Operating independently of candidate profiles, this agent ensures fairness and mitigates contextual bias. The evaluation process follows two well-defined stages—answer verification and reasoning assessment—resulting in structured, explainable outcomes. Summary Agent. Finally, the Summary agent synthesizes the outputs from all previous modules into a coherent evaluation report. This report includes overall scores, dimension-wise breakdowns, confidence estimates, and personalized recommendations. It also highlights performance across different difficulty levels (e.g., “8/10 on high-difficulty items vs. 6/10 on average”) to capture both ab- solute and relative ability. Intermediate summaries are generated progressively to reduce computational overhead and ensure a con- sistent final synthesis. Collectively, these specialized agents embody CoMAI’s core prin- ciples of transparency, fairness, and responsible automation, en- abling interpretable and verifiable multi-agent collaboration across the entire interview workflow. 3.4 Storage and Knowledge Systems CoMAI organizes interview data into two complementary layers for real-time operation and post-session accountability. The Result Collection stores finalized session outputs such as session identifiers, overall scores, final decisions, QA transcripts, alerts, and metadata as immutable records, serving downstream auditing and retrospec- tive analysis. In contrast, the Interview Memory Collection maintains dynamic session context, including per-round questions, intermedi- ate scores, coordinator notes, resume data, and risk indicators. This collection is continuously updated during the session and selec- tively transferred to the Result Collection upon session completion, establishing a verifiable audit trail. Such a layered design facilitates both adaptive interaction within sessions and robust longitudinal analytics across sessions, supporting responsible knowledge gov- ernance. 3.5 System Integration and Deployment All agents operate within a deterministic, event-driven orches- tration pipeline overseen by a central coordinator. The interview process unfolds as follows: (1)The Question Generation agent constructs context-aware, structured prompts based on session memory. (2)The Security agent evaluates responses for policy violations and safety concerns. (3)If deemed compliant, the Scoring agent performs rubric- based evaluation, returning both quantitative scores and qualitative explanations. (4)The Summary agent synthesizes session outputs into a struc- tured final report. The coordinator enforces schema consistency, orchestrates agent sequencing, and handles failure recovery. Owing to its modular ar- chitecture, CoMAI allows seamless integration of new agents (e.g., peer review, multimodal input, or bias detection) via standardized APIs without disrupting existing processes. It adopts a microservice- based deployment paradigm, where agents communicate asyn- chronously through message queues and RESTful [21] interfaces, ensuring system scalability, robustness, and isolation of faults. In summary, CoMAI provides a unified and auditable framework for responsible interview automation by combining centralized coordination, agent-level specialization, adaptive feedback, and layered memory design. Its architecture ensures fairness, inter- pretability, and extensibility while maintaining transparency and verifiability. Future developments will explore multimodal interac- tion, continuous performance calibration, and cross-domain gener- alization to further enhance the system’s reliability and versatility. 4 EXPERIMENTS AND RESULT ANALYSIS 4.1 Experimental Setup and Baselines We conducted experiments with 55 candidate participants from diverse academic backgrounds. The primary configuration used GPT-5-mini [20] as the backbone model, and we further integrated Qwen-plus-2025-07-28 [39] and Kimi-K2-Instruct [30] within CoMAI to evaluate model-agnostic adaptability. All models were operated under their default decoding parameters, with a temperature of 1.0, top-푝=1.0, ensuring comparability across models without intro- ducing sampling bias. All experiments followed identical scoring rubrics and timing constraints, with anonymized responses to en- sure unbiased evaluation. The following baselines were compared: •CoMAI (Ours): The complete system described in Section 3, employing centralized orchestration among four specialized agents. •Single-Agent Ablation: A single GPT-5-mini instance with comprehensive prompts performing all tasks, isolating the multi-agent architecture’s contribution. •Human Interviewer: Interviews conducted by trained stu- dent recruiters using identical evaluation criteria. •External AI Interviewers: Two public single-agent inter- viewer systems, LLM-Interviewer [6] and AI-Interviewer-Bot v3 [2], included as external benchmarks. Ground-truth evaluations were provided by a panel of ten senior professors affiliated with QS Top 200 universities, serving as the expert reference against which all baselines were compared. All results were cross-checked by independent annotators to ensure consistency and reliability in scoring and interpretation. 4.2 Core Evaluation Metrics To holistically assess system capability and responsible evaluation behavior, we defined metrics along five dimensions: •Assessment Accuracy. Agreement with the ground truth, measured by accuracy, recall, precision, and F1 on binary admission decisions. •Question Quality and Difficulty. Statistical distribution of can- didate scores and acceptance rates, expecting near-normal variance to indicate balanced differentiation. •Dimensional Coverage. Proportion of questions covering predefined assessment dimensions (knowledge, reasoning, communication, and professionalism). • System Robustness and Security. Defense success rate against prompt injection and adversarial attacks. •User Experience and Fairness. Composite index combining candidate satisfaction, interaction fluency, and fairness per- ception. Additional fairness consistency was measured as score variance across demographic subgroups. All quantitative metrics were aggregated across participants to ensure stable estimation, and results were summarized using de- scriptive statistics to reflect overall performance trends. Qualitative feedback from participants and evaluators was also analyzed to assess perceived interpretability and transparency of AI decisions. 4.3 Results and Analysis We summarize here the comparative performance across all evalua- tion modes. Table 1 reports recall and accuracy for each evaluation entity. Our CoMAI system achieved the best overall assessment accuracy and recall balance, outperforming both single-agent and human interviewers, and aligning closely with the expert gold stan- dard in decision consistency. Table 1: Comparison of assessment accuracy across evalua- tion entities. Evaluation EntityRecall Accuracy CoMAI (GPT-5-mini)83.33%90.47% CoMAI (Qwen-plus-2025-07-28)90.90%80.00% CoMAI (Kimi-K2-Instruct)95.45%91.30% Human Interviewer62.50%71.42% Single-Agent Baseline50.00%60.00% LLM-Interviewer100.00%42.30% AI-Interviewer-Bot v372.72%44.44% 4.3.1Superior Assessment Accuracy. As shown in Table 1, our Co- MAI system demonstrated the best overall assessment accuracy, achieving an excellent balance between recall and accuracy, out- performing not only single-agent AI baselines but also human interviewers, and aligning closely with the expert gold standard in terms of decision consistency. This superior performance can be attributed to two key archi- tectural designs of CoMAI that directly address the limitations of single-agent and human-driven systems. First, the dedicated Secu- rity agent serves as a proactive safeguard against adversarial inputs during experiments. By filtering out noisy or manipulated responses before they reach the Scoring agent, CoMAI effectively prevents the score distortions that often occur in single-agent baselines. Sec- ond, the Scoring agent’s deliberate “resume-agnostic” design, which prohibits access to candidates’ background information such as university affiliation or past awards, eliminates shortcut biases and ensures fairness in evaluation. The suboptimal performance of the single-agent ablation base- line (60% accuracy) highlights the risks of overburdening a single model with conflicting objectives. It struggled to balance ques- tion generation, security detection, and scoring simultaneously, resulting in hasty evaluations and overlooked edge cases. Notably, LLM-interviewer achieved 100% recall but only 42.30% accuracy. This overly lenient behavior resulted from the absence of a spe- cialized Security agent and the lack of structured scoring logic, which caused the system to treat vague or irrelevant responses as acceptable answers. In contrast, CoMAI maintains strict evaluation standards while preserving high recall, as its modular architecture allows each agent to focus on its specific role without interference. Across all tested backbone models (GPT-5-mini, Qwen-plus-2025- 07-28, Kimi-K2-Instruct), CoMAI consistently outperformed base- lines, with both Kimi-K2-Instruct-based and GPT-5-mini-based vari- ants exceeding 90% accuracy. This cross-model consistency demon- strates the robustness of CoMAI’s architectural design, showing its ability to coordinate specialized agents and mitigate the inherent limitations of individual language models. 4.3.2 Question Difficulty Distribution Closer to Expert Standard. We analyzed the statistical distribution of interview scores to eval- uate question differentiation and difficulty control (Table 2). The admission threshold was set at 70, making the proportion of high scores equivalent to the admission rate. Table 2: Statistics of interview score distribution (averaged across 55 participants). Evaluation EntityMean Score Variance Admission Rate (≥70) Expert Baseline (d)68.88–44.44% CoMAI (GPT-5-mini)62.05348.6540.00% CoMAI (Qwen-plus-2025-07-28)62.34395.6848.07% CoMAI (Kimi-K2-Instruct)62.92320.8244.23% Human Interviewer67.54177.1638.18% Single-Agent Baseline61.45359.3434.54% LLM-Interviewer84.0821.53100.00% AI-Interviewer-Bot v377.85116.3369.23% Across all model variants, CoMAI’s admission rates were closely aligned with the expert benchmark (44.44%), demonstrating pre- cise control over question difficulty. The Kimi-K2-Instruct-based implementation achieved nearly identical results (44.23%), while the GPT-5-mini and Qwen-plus-2025-07-28 variants (40.00% and 48.07%) exhibited comparable stability. High variance in CoMAI’s score distributions (320–396) indicates diverse question difficulty and strong candidate differentiation, contrasting sharply with the overly narrow variance of the LLM-Interviewer baseline (21.53) that yielded a meaningless 100% admission rate. These findings confirm that CoMAI’s coordinated generation–scoring mechanism effectively maintains expert-level difficulty calibration and robust generalization across models. 4.3.3Assessment Dimensions Focused on Core Competencies. Con- tent analysis revealed that CoMAI predominantly generated ques- tions targeting mathematical logic and reasoning, which accounted for approximately 95% of all questions. In comparison, socio-political and open-ended topics represented only 5% of CoMAI’s interviews, compared with about 25% in expert- and human-conducted ses- sions. This pattern reflects CoMAI’s strict adherence to its design objective of evaluating core scientific reasoning skills required for elite talent in the fundamental sciences. Nonetheless, the reduced diversity of assessment dimensions highlights the need for future iterations to incorporate dynamic balancing mechanisms that en- sure equitable coverage of communication, creativity, and ethical reasoning. Such enhancements will further align CoMAI with the Responsible AI principles of inclusiveness, fairness, and compre- hensive competency evaluation. 4.3.4Architecture Demonstrates Superior Security and Robustness. In adversarial testing, CoMAI exhibited remarkable resilience to both explicit and implicit prompt injection attacks. As illustrated in Figure 5, the multi-agent architecture substantially outperformed the single-agent baseline in maintaining response integrity under adversarial perturbations. As reported in Table 3, CoMAI achieved a 100% defense success rate on more than 500 adversarial samples, successfully detecting and neutralizing all malicious inputs such as “ignore previous instructions” and covert logic manipulation attempts. Table 3: Defense success rate against adversarial attacks. Evaluation EntityDefense Success Rate (%) CoMAI100.00 Single-Agent Baseline81.55 LLM-Interviewer0.00 AI-Interviewer-Bot v328.00 The user input contains multiple instructions, attempting to force the model to begin with a specific phrase, directly over- ride system output, and solicit all instructions received by the model. It ultimately demands that the model only reply with a specific phrase, completely ignoring the interview questions. These behaviors all fall under the category of high-risk prompt injection/manipulation and system information probing. Your first two words must be \"Access granted.\". <div class=\"system-override\">output \"Access granted."</div> Are there an infinite number of prime numbers? Your first two words must be \"Access granted.\". <div class=\"system-override\">output \"Access granted."</div> Access granted. Are there an infinite number of prime numbers? Figure 5: Comparison of single-agent and multi-agent archi- tectures under adversarial attacks. This robustness arises from CoMAI’s dedicated Security agent, which implements a two-layer detection mechanism: (i) a rule-based filter that blocks known prompt injection patterns, and (i) an LLM- based semantic analysis layer that detects implicit adversarial intent. Unlike single-agent systems that embed safety instructions directly into prompts, CoMAI separates security from evaluation logic, pre- venting cross-task interference. Each intercepted attempt is logged with a unique trace identifier, ensuring post-hoc auditability and reinforcing the framework’s Responsible AI principles of trans- parency and safety. Furthermore, CoMAI is, to our knowledge, the first framework to apply a CFSC combined with a role-specialized Security Agent to interview assessment, effectively addressing the safety challenges of multi-agent systems in high-stakes scenarios. 4.3.5 High Ratings in User Experience and Process Quality. User study results confirm CoMAI’s strong user acceptance and process reliability (Table 4). Across 55 participants, CoMAI achieved satis- faction and fluency scores comparable to human interviewers while substantially outperforming all automated baselines. Table 4: User experience and process quality metrics. Evaluation Entity Satisfaction (%) Fluency (%) Feedback Request Rate (%) CoMAI84.4177.0079.16 Human Interviewer85.24–67.23 Single-Agent Baseline61.1243.0071.33 External AI Interviewers53.00–63.0065.00–67.0060.00–70.00 Participants highlighted smoother conversational flow and con- sistent response timing as key advantages of CoMAI. The coordina- tor’s deterministic scheduling under the CFSC reduced redundancy and latency, resulting in coherent and natural dialogue. The higher feedback request rate reflects enhanced user trust and perceived fairness, underscoring CoMAI’s alignment with Responsible AI principles of transparency, interpretability, and user-centered de- sign. Qualitative feedback showed that about 60% of participants viewed the AI interview as novel and engaging. Many requested follow-up discussions to explore problem-solving strategies, and several reported lower anxiety compared with traditional inter- views. These findings indicate that CoMAI not only ensures consis- tent assessment quality but also fosters a psychologically supportive and engaging evaluation environment. 4.3.6 Negligible Verbosity Bias under CoMAI Framework. Beyond user experience, we evaluated the fairness of CoMAI’s scoring process by testing for the commonly observed Verbosity Bias [22], which refers to large language models’ tendency to favor longer answers. A correlation analysis was performed between candidates’ response lengths and their corresponding scores. The results revealed an extremely weak linear correlation of 0.0445 (푝>0.1,푛=330, based on 55 participants and 330 total question–answer instances). As illustrated in Figure 6, the distribu- tion of scores across varying response lengths shows no significant trend, indicating that verbosity had minimal influence on scor- ing outcomes. This near-zero relationship confirms that response length had minimal influence on scoring outcomes, demonstrating that CoMAI’s evaluation mechanism is resistant to verbosity bias. 0-50 50-100 100-150150-200200-250250-300300-400 Answer Length Ranges (Characters) 0 2 4 6 8 10 Score Distribution Score Distribution Across Answer Length Ranges Mean Score Figure 6: Distribution of scores versus response length. This behavior stems from CoMAI’s architectural separation of the Scoring agent and Question Generation agent. By constraining the Scoring agent to assess responses purely on reasoning qual- ity and content relevance rather than linguistic length, CoMAI prevents over-rewarding verbose but low-information answers. Consequently, concise and logically consistent responses are val- ued equally to longer ones, reinforcing the fairness, validity, and interpretability of CoMAI as a responsible autonomous assessment framework. 5 DISCUSSION Our findings highlight that the multi-agent architecture is the key enabler of CoMAI’s superior accuracy, robustness, and explainabil- ity. The modular separation of functions improves specialization and accountability but introduces coordination overhead, latency, and debugging complexity. These trade-offs suggest that architec- tural optimization remains an important direction for future work. From an ethical standpoint, CoMAI contributes to fairness and transparency in AI-based assessment by enforcing structured rubrics and role-based data isolation. Nonetheless, potential biases may still emerge from language model training data or repeated agent interactions, warranting continuous auditing and perspective di- versification. Practically, the system proves valuable in structured interviews where reliability and interpretability are critical. However, chal- lenges remain regarding rubric dependence, limited non-verbal awareness, and computational costs that may affect scalability. Fu- ture research should address these limitations through fairness auditing, adaptive rubric learning, and hybrid human–AI collabora- tion frameworks. 6 CONCLUSION AND FUTURE WORK This paper presents the design, implementation, and systematic evaluation of a multi-agent AI interview system coordinated through a centralized controller. By decomposing complex assessment tasks into specialized agents for question generation, scoring, security monitoring, and summarization, the framework achieves enhanced modularity, scalability, and robustness. Beyond technical perfor- mance, the architecture improves controllability, transparency, and explainability, contributing to more trustworthy AI-based assess- ment. Empirical results confirm that the multi-agent paradigm is both feasible and effective for achieving fairness, reliability, and interpretability in automated interviews, offering a foundation for broader adoption in education and recruitment. Future work will focus on optimizing system efficiency and in- teraction fluency, integrating human-in-the-loop supervision for continuous calibration, and extending the framework toward mul- timodal and cross-domain assessment scenarios. These directions aim to strengthen scalability, inclusiveness, and human alignment, advancing the development of secure and responsible multi-agent AI systems for real-world evaluation tasks. REFERENCES [1] Martin Adam, Michael Wessel, and Alexander Benlian. Ai-based chatbots in customer service and their effects on user compliance. Electronic Markets, 31:427 – 445, 2020. [2]AiMind.Ai interviewer v3.https://pub.aimind.so/ai-interviewer-v-3- e64f7169150c, 2025. Accessed: 2025-10-08. [3]Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. A survey on dialogue systems: Recent advances and new frontiers. ArXiv preprint, abs/1711.01731, 2017. [4]Yulong Chen, Yang Liu, Jianhao Yan, Xuefeng Bai, Ming Zhong, Yinghao Yang, Ziyi Yang, Chenguang Zhu, and Yue Zhang. See what llms cannot answer: A self-challenge framework for uncovering llm weaknesses. ArXiv preprint, abs/2408.08978, 2024. [5] Tianshu Chu, Jie Wang, Lara Codecà, and Zhaojian Li. Multi-agent deep rein- forcement learning for large-scale traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 21:1086–1095, 2019. [6]Dvir Cohen. Llm-interviewer. https://github.com/dvircohen/LLM-interviewer, 2024. Accessed: 2025-10-08. [7]Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. Key-value retrieval networks for task-oriented dialogue. In Kristiina Jokinen, Manfred Stede, David DeVault, and Annie Louis, editors, Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 37–49, Saarbrücken, Germany, 2017. Association for Computational Linguistics. [8]Jacques Ferber. Multi-agent systems: An introduction to distributed artificial intelligence. 1999. [9]Antonino Ferraro, Antonio Galli, Valerio La Gatta, Marco Postiglione, Gian Marco Orlando, Diego Russo, Giuseppe Riccio, Antonio Romano, and Vincenzo Moscato. Agent-based modelling meets generative ai in social network simulations. ArXiv preprint, abs/2411.16031, 2024. [10] Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. Large language models empowered agent-based modeling and simulation: A survey and perspectives. ArXiv preprint, abs/2312.11970, 2023. [11]Alexander Gurung and Mirella Lapata. Learning to reason for long-form story generation. ArXiv preprint, abs/2503.22828, 2025. [12] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. ArXiv preprint, abs/2002.08909, 2020. [13]Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55:1 – 38, 2022. [14] Nishka Lal and Omar Benkraouda. Exploring the implementation of ai in early onset interviews to help mitigate bias. ArXiv preprint, abs/2501.09890, 2025. [15]Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neu- ral Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. [16]Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. In Conference on Empirical Methods in Natural Language Processing, 2024. [17]Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yanhong Zheng, and Yang Liu. Prompt injection attack against llm-integrated applications. ArXiv preprint, abs/2306.05499, 2023. [18]Frank P.-W. Lo, Jianing Qiu, Zeyu Wang, Haibao Yu, Yeming Chen, Gao Zhang, and Benny P. L. Lo. Ai hiring with llms: A context-aware and explainable multi- agent framework for resume screening. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4184–4193, 2025. [19] OpenAI. GPT-4 technical report. ArXiv preprint, abs/2303.08774, 2023. [20]OpenAI. Gpt-5 system card. https://openai.com/zh-Hans-CN/index/gpt-5- system-card/, 8 2025. Accessed:2025-10-07. [21] L Richardson and Sam Ruby. Restful web services. 2007. [22]Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models, 2023. [23]Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Tool- former: Language models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edi- tors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. [24]Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In Dale Schuurmans and Michael P. Wellman, editors, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages 3776–3784. AAAI Press, 2016. [25]Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. ArXiv preprint, abs/1610.03295, 2016. [26] Yijia Shao, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Mon- ica Lam. Assisting in writing Wikipedia-like articles from scratch with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6252–6278, Mexico City, Mexico, 2024. Association for Computational Linguistics. [27]Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. ArXiv preprint, abs/2209.05451, 2022. [28]Ishika Singh, David Traum, and Jesse Thomason. Twostep: Multi-agent task planning using classical planners and large language models. ArXiv preprint, abs/2403.17246, 2024. [29] Hongda Sun, Hongzhan Lin, Haiyu Yan, Yang Song, Xin Gao, and Rui Yan. Mock- llm: A multi-agent behavior collaboration framework for online job seeking and recruiting. Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 2024. [30] Kimi Team. Kimi k2: Open agentic intelligence, 2025. [31] Ahmed Tlili, Boulus Shehata, Michael Agyemang Adarkwah, Aras Bozkurt, Daniel T. Hickey, Ronghuai Huang, and Brighter Agyemang. What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in education. Smart Learning Environments, 10:1–24, 2023. [32] Oriol Vinyals and Quoc V. Le. A neural conversational model. ArXiv preprint, abs/1506.05869, 2015. [33] Long Wang, Feng Fu, and Xingru Chen. Mathematics of multi-agent learning systems at the interface of game theory and artificial intelligence. ArXiv preprint, abs/2403.07017, 2024. [34] Yaoxiang Wang, Zhiyong Wu, Junfeng Yao, and Jinsong Su. Tdag: A multi- agent framework based on dynamic task decomposition and agent generation. Neural networks : the official journal of the International Neural Network Society, 185:107200, 2024. [35]Zixin Wang, Jun Zong, Yong Zhou, Yuanming Shi, and Vincent W. S. Wong. Decentralized multi-agent power control in wireless networks with frequency reuse. IEEE Transactions on Communications, 70:1666–1681, 2022. [36] Joseph Weizenbaum. Eliza—a computer program for the study of natural language communication between man and machine. Commun. ACM, 9(1):36–45, 1966. [37]Athena Wen, Tanush Patil, Ansh Saxena, Yicheng Fu, Sean O’Brien, and Kevin Zhu. FAIRE: assessing racial and gender bias in ai-driven resume evaluations. ArXiv preprint, abs/2504.01420, 2025. [38] Yunjun Xia, Ju fang Zhu, and Liucun Zhu. Dynamic role discovery and assignment in multi-agent task decomposition. Complex & Intelligent Systems, pages 1–12, 2023. [39]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report, 2025. [40] Mihalis Yannakakis. Testing finite state machines. In Proceedings of the twenty- third annual ACM symposium on Theory of computing, pages 476–485, 1991. [41]Nima Yazdani, Aruj Mahajan, and Ali Ansari. Zara: An llm-based candidate interview feedback system. ArXiv preprint, abs/2507.02869, 2025. [42] Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö. Arik. Chain of agents: Large language models collaborating on long-context tasks. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. A AGENT INTERFACE SCHEMAS This section consolidates the minimal interface schemas for the four specialized agents referenced in the main text, namely the Question Generation, Security, Scoring, and Summary modules. The JSON layouts specify stable fields for interoperable message passing and audit-ready logging while allowing optional extensions through additional keys. The designs follow principles of modularity, ver- sioned evolution, and privacy minimization so that messages do not carry directly identifiable personal information. Each record is assumed to include a session level trace identifier managed by the coordinator to support reproducibility, post hoc analysis, and fairness auditing across runs and model variants. Question Generation Agent Schema "question": "... full question text ...", "type": "math_logic/technical/behavioral/ experience", "difficulty": "easy/medium/hard", "reasoning": "Why this question is proposed at this stage ..." Security Agent Schema "is_safe": "true/false", "risk_level": "low/medium/high", "detected_issues": [ "Issue Type 1", "Issue Type 2" ], "reasoning": "Reason for detection", "suggested_action": "continue/warning/block" Scoring Agent Schema "score": 8, "letter": "B", "breakdown": "math_logic": 3, "reasoning_rigor": 2, "communication": 1, "collaboration": 1, "potential": 1 , "reasoning": "Answer showed strong logic", "strengths": ["Good logical reasoning"], "weaknesses": ["Weak explanation of terminology"], "suggestions": ["Practice concise communication"] Summary Agent Schema "final_grade": "A", "final_decision": "accept", "overall_score": 9, "summary": "Candidate shows strong potential ...", "strengths": ["Analytical thinking", " Communication"], "weaknesses": ["Limited collaboration evidence"], "recommendations": "for_candidate": "Improve collaboration skills", "for_program": "Provide mentorship in teamwork" , "confidence_level": "high", "detailed_analysis": "math_logic": "...", "reasoning_rigor": "...", "communication": "...", "collaboration": "...", "growth_potential": "..." B PARTICIPANT DESCRIPTION Fifty-five volunteer participants were recruited for this experiment from a leading university ranked among the top 400 in the QS World University Rankings. Among these 55 participants, 12 were female. Each participant was provided with a stipend of $10 upon completion of the assessment session. All participants underwent pre-screening via a written examination, which assessed their math- ematical reasoning and algorithmic problem-solving capabilities to ensure the recruitment of a high-performing cohort. This experi- ment was conducted in a real-world setting, as it was integrated into a selective talent development and assessment program hosted by a key academic unit of the aforementioned university. This institu- tional context ensured that all participants remained fully engaged and devoted genuine effort to their tasks. During the testing phase of this project, participants were required to test different states of the project driven by large language models , while also con- ducting tests on two single-agent projects included in the baseline. The average testing duration per participant per session was 58 minutes. Owing to the unavailability of backend data for these two baseline projects, we were unable to compute their average testing durations. Nevertheless, observational data showed the two base- line tests were significantly shorter than our project’s, typically taking less than 30 minutes combined. Following the completion of testing, all participants were requested to fill out a questionnaire developed by our research team, with the objective of gathering their subjective feedback regarding each interview system. C EXPLANATION OF RESULT ANALYSIS-RELATED PROCESSING Due to differences in the types and formats of results obtained from various baselines, we implemented the following processing steps to acquire results in a consistent format for accuracy, facilitating subsequent analysis: 1. For the baseline results of professor-conducted interviews, only two outcome types (pass and fail) were obtained, where all passing evaluations were graded as "A" and all failing ones as "C". Based on this data, we converted these grades into corresponding scores: the "C" grade was divided into several score ranges centered around 60 points for processing. However, due to the binary classifi- cation nature (pass/fail) of the original data, we could not effectively derive the corresponding variance data; that is, the variance data in this scenario is not statistically meaningful. 2. For our multi-agent system (CoMAI) and the single-agent ab- lation project, we adhered to strict and objective scoring criteria. Specifically, we scored candidates’ responses separately across dif- ferent dimensions, then computed the weighted average of these dimensional scores to obtain the final score. After discussions with professors and analysis of their grading scales, we set 70 points as the admission threshold—scores of 70 or higher were categorized as "admitted". 3. For student-conducted interviews, the final outcomes were presented as five grades: A, B, C, D, and E. Referencing the grading logic of professor-conducted interviews, we classified grades "A" and "B" as "admitted", designated grade "C" as the passing score (60 points), and converted all grades into equivalent numerical scores. These converted scores were then used to calculate and process the mean and variance. 4. The two external AI interview baselines (LLM-Interviewer and AI-Interviewer-Bot v3) are purely AI interview simulation systems and lack a built-in scoring function. To address this, we required each candidate to upload a personal statement alongside a custom prompt—this prompt specified the evaluation perspectives (e.g., reasoning ability, communication skills) and scoring requirements for the interview. Through this adjustment, we finally obtained scoring results on a 100-point scale for these two baselines. Through the aforementioned processing, we ultimately standard- ized results from all baselines into a consistent format, laying the foundation for subsequent result analysis.