Paper deep dive

Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks

Yanming Mu, Hao Hu, Feiyang Li, Qiao Yuan, Jiang Wu, Zichuan Liu, Pengcheng Liu, Mei Wang, Hongwei Zhou, Yuling Liu

Year: 2026Venue: arXiv preprintArea: cs.CRType: PreprintEmbeddings: 162

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 98%

Last extracted: 3/26/2026, 2:31:56 AM

Summary

This paper provides a comprehensive review of security vulnerabilities, threat vectors, and defense mechanisms in Retrieval-Augmented Generation (RAG) systems. It categorizes threats into data poisoning, adversarial attacks, and membership inference, while proposing a taxonomy of defenses across input and output stages, alongside a unified benchmark for future security research.

Entities (7)

Adversarial Attacks · threat · 100%Data Poisoning · threat · 100%Membership Inference Attacks · threat · 100%Retrieval-Augmented Generation · technology · 100%Generator · component · 95%Retriever · component · 95%Vector Database · component · 95%

Relation Signals (4)

Vector Database → ispartof → RAG

confidence 100% · The technical workflow of RAG primarily consists of three core modules: Vector Database Construction, the Retriever, and the Generator.

RAG → isvulnerableto → Data Poisoning

confidence 100% · data poisoning attacks can manipulate system outputs by injecting a small amount of malicious text into the knowledge base

RAG → isvulnerableto → Adversarial Attacks

confidence 100% · adversarial attacks can manipulate system outputs by adding minor perturbations to inputs

RAG → isvulnerableto → Membership Inference Attacks

confidence 100% · Membership inference attacks allow attackers to infer data entries within the system’s database

Cypher Suggestions (2)

Map the RAG architecture components · confidence 95% · unvalidated

MATCH (a:Technology {name: 'RAG'})-[:HAS_COMPONENT]->(c:Component) RETURN a, c

Find all threats associated with RAG components · confidence 90% · unvalidated

MATCH (c:Component)-[:HAS_VULNERABILITY]->(t:Threat) WHERE c.name = 'RAG' RETURN c, t

Abstract

Abstract:Retrieval-Augmented Generation (RAG) significantly mitigates the hallucinations and domain knowledge deficiency in large language models by incorporating external knowledge bases. However, the multi-module architecture of RAG introduces complex system-level security vulnerabilities. Guided by the RAG workflow, this paper analyzes the underlying vulnerability mechanisms and systematically categorizes core threat vectors such as data poisoning, adversarial attacks, and membership inference attacks. Based on this threat assessment, we construct a taxonomy of RAG defense technologies from a dual perspective encompassing both input and output stages. The input-side analysis reviews data protection mechanisms including dynamic access control, homomorphic encryption retrieval, and adversarial pre-filtering. The output-side examination summarizes advanced leakage prevention techniques such as federated learning isolation, differential privacy perturbation, and lightweight data sanitization. To establish a unified benchmark for future experimental design, we consolidate authoritative test datasets, security standards, and evaluation frameworks. To the best of our knowledge, this paper presents the first end-to-end survey dedicated to the security of RAG systems. Distinct from existing literature that isolates specific vulnerabilities, we systematically map the entire pipeline-providing a unified analysis of threat models, defense mechanisms, and evaluation benchmarks. By enabling deep insights into potential risks, this work seeks to foster the development of highly robust and trustworthy next-generation RAG systems.

PDF

Open source PDF →Open local PDF →

Full Text

161,743 characters extracted from source content.

Expand or collapse full text

Towards Secure Retrieval-Augmented Generation: A Comprehensive Review of Threats, Defenses and Benchmarks Yanming Mu 1,2 , Hao Hu 1,2* , Feiyang Li 1,2 , Qiao Yuan 3 , Jiang Wu 3 , Zichun Liu 3 , Pengcheng Liu 3 , Mei Wang 3 , Hongwei Zhou 3 , Yuling Liu 4 1 State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou, 450001, China. 2 Information Engineering University, Zhengzhou, 450001, China. 3 Henan Key Laboratory of Information Security, Zhengzhou, 450001, China. 4 School of Cyber Security, University of Chinese Academy of Sciences, Beijing, 100000, China. *Corresponding author(s). E-mail(s): wjjhh 908@163.com; Contributing authors: mu242211@163.com; lfyxxgcdx@163.com; 13883065823@163.com; liam181113@163.com; ieuliuzc@163.com; liupengcheng2016@163.com; cybersecuritys@126.com; hongweizhou@126.com; liuyuling@iie.ac.cn; Abstract Retrieval-Augmented Generation (RAG) significantly mitigates the hallucina- tions and domain knowledge deficiency in large language models by incorporating external knowledge bases. However, the multi-module architecture of RAG introduces complex system-level security vulnerabilities. Guided by the RAG workflow, this paper analyzes the underlying vulnerability mechanisms and sys- tematically categorizes core threat vectors such as data poisoning, adversarial attacks, and membership inference attacks. Based on this threat assessment, we construct a taxonomy of RAG defense technologies from a dual perspective encompassing both input and output stages. The input-side analysis reviews data protection mechanisms including dynamic access control, homomorphic encryption retrieval, and adversarial pre-filtering. The output-side examina- tion summarizes advanced leakage prevention techniques such as federated 1 arXiv:2603.21654v1 [cs.CR] 23 Mar 2026 learning isolation, differential privacy perturbation, and lightweight data sani- tization. To establish a unified benchmark for future experimental design, we consolidate authoritative test datasets, security standards, and evaluation frame- works. To the best of our knowledge, this paper presents the first end-to-end survey dedicated to the security of RAG systems. Distinct from existing lit- erature that isolates specific vulnerabilities, we systematically map the entire pipeline—providing a unified analysis of threat models, defense mechanisms, and evaluation benchmarks. By enabling deep insights into potential risks, this work seeks to foster the development of highly robust and trustworthy next-generation RAG systems. Keywords: Retrieval-Augmented Generation, large language model security, attack classification, defense techniques,security evaluation standards 1 Introduction The proposal of the Transformer architecture in 2017 [1] revolutionized the field of Natural Language Processing (NLP) and laid the core technical foundation for the rapid development of Large Language Models (LLMs). From early milestones such as BERT [2] and GPT [3] to subsequent iterations like GPT-4 [4] and Deepseek-R1 [5], the technical evolution of LLMs continues to drive breakthroughs in related research and applications. Currently, these models demonstrate strong application potential across various interdisciplinary scenarios, including medicine [6], chemical research [7], energy system optimization [8], and robotic entity control [9]. This progress has not only expanded the technical boundaries of NLP but also provided new technical pathways for solving complex problems. Compared to smaller language models, LLMs possess unique emergent abilities [10], such as in-context learning [11] and logical reasoning [12]. Furthermore, supported by massive training data, pre-trained LLMs accumulate vast amounts of world knowledge. These advantages have enabled the application of language models to expand from traditional language modeling to practical task solving, covering fundamental tasks like text classification and sentiment analysis, as well as more challenging areas such as high-level task planning and complex decision-making. However, the performance of LLMs is constrained by the quality and scope of their training data. For instance, when presented with real-time or domain-specific queries, the underlying LLM may generate answers that appear plausible but are factually incorrect. This phenomenon is known as hallucination [13]. To address the hallucination issue in LLMs, the Facebook (now Meta) AI Research team [14] introduced Retrieval-Augmented Generation (RAG) in 2020. By integrating retrieval and generation modules, RAG aims to improve the timeliness and accuracy of generated content, effectively mitigating common challenges such as hallucina- tions [13] and knowledge obsolescence [15–17]. Following the validation of its efficacy, RAG entered a period of rapid development. The proposal of technical routes such as GraphRAG [18] and AgenticRAG [19] has significantly broadened the capabilities of RAG. This ability to incorporate external knowledge sources has demonstrated 2 Table 1: Comparison of Research Scopes in Existing Survey Literature on RAG Secu- rity (✓ indicates the existence of this dimension; × indicates the absence of this dimension) AspectThis sur- vey Wu (2025) Gu (2025) He (2025) Arz (2025) Wang (2025) RAG Architecture Anatomy✓×✓ Attack Vectors & Threat Surfaces✓×✓ Security Defense & Mitigation Strategies✓×✓×✓ Evaluation Benchmarks & Metrics System✓× immense potential in fields such as Question Answering (QA) systems [20], medical applications [21], and academic education services [22]. For example, the introduction of the RAGCare-QA dataset aims to evaluate the performance of RAG pipelines in theoretical medical QA within medical education [23]. Despite the broad prospects of RAG technology, its multi-modular architecture introduces complex security and privacy risks. The widespread application of RAG frameworks has prompted in-depth academic investigation into their security characteristics. Studies indicate that RAG may, in certain cases, compromise model security and alter its security profile [24]. The poten- tial attack surface of RAG includes data poisoning attacks [25], membership inference attacks [26], and adversarial attacks [27]. For instance, data poisoning attacks can manipulate system outputs by injecting a small amount of malicious text into the knowledge base [16]. Membership inference attacks allow attackers to infer data entries within the system’s database based on system outputs [26]. Additionally, adversarial attacks can manipulate system outputs by adding minor perturbations to inputs to bypass detection mechanisms [27]. To address these emerging security challenges, researchers from top universities and companies worldwide—including the University of Oxford, Amazon AWS AI, the National University of Singapore, the University of Cambridge, the Institute of Infor- mation Engineering (CAS), and Tsinghua University—are actively exploring various defense strategies and frameworks. Their findings have been published in top-tier venues such as SIGIR, USENIX Security Symposium, CCS, NDSS, EMNLP, ACL, and TPAMI. For example, [28] aim to enhance the privacy and security of RAG appli- cations through privacy-aware retrieval mechanisms, decentralized access control, and real-time model auditing.[29] proposed a risk assessment and mitigation framework to ensure security when integrating sensitive data in RAG. Furthermore, as RAG becomes increasingly prevalent in industry, data and service security have become critical pri- orities [29]. These efforts collectively advance the field of RAG security and highlight the urgency of comprehensive security analysis and defense research. As illustrated in Table 1, existing survey literature on RAG security presents an incomplete research scope. This survey achieves comprehensive coverage across four core dimensions, specifically encompassing architectural analysis, threat analysis, defense strategies, and evaluation metrics. 3 This survey aims to provide a systematic review and in-depth analysis of existing research in the field of RAG security, covering threat models, attack types, potential vulnerabilities, and proposed defense strategies. To gain a deep understanding of the latest developments in RAG security research, we conducted a systematic and exten- sive survey of the relevant literature. Based on a comprehensive analysis of 152 relevant papers, we examine the security risks introduced by each module within the RAG multi-modular architecture and compare the characteristics and impacts of different attack vectors. Specifically, this survey will: 1. Outline the architecture and core components of RAG: Clarify how the retrieval and generation modules collaborate and their respective significance in security [17, 30]. 2. Identify and categorize major security threats: Analyze attack types such as knowl- edge poisoning, adversarial attacks, and membership inference attacks, and explore how these attacks exploit specific mechanisms of RAG to achieve malicious purposes [1, 16]. 3. Review existing security assessment methods and tools: Examine current method- ologies for evaluating RAG security, such as comparing security performance between RAG and non-RAG frameworks [24]. 4. Discuss mitigation strategies and defense mechanisms: Summarize proposed tech- niques, including privacy-enhancing technologies, robustness defenses, and counter- measures against specific attacks [28, 29]. 5. Propose future research directions: Identify current limitations and suggest future focus areas, exploring the integration of searchable encryption with RAG security protection to build more secure and reliable RAG technologies. By providing a comprehensive overview of the RAG security landscape, this survey aims to serve as a valuable reference for researchers, developers, and policymakers. It seeks to facilitate a better understanding of and response to the security and privacy challenges posed by RAG in practical applications, thereby promoting the healthy development and widespread adoption of the technology. 2 Overview of RAG Technology In 2020, the Facebook (now Meta) AI Research team [14] introduced Retrieval- Augmented Generation (RAG). Early research often defined it as “external memory” or “external knowledge bases.” Although there was initially debate within the industry regarding the choice between RAG and Fine-tuning, RAG ultimately established its indispensable status within the AI ecosystem due to its significant cost-effectiveness and real-time advantages [31]. By early 2024, the maturation of LLMOps architec- tures significantly lowered the barriers to system construction, marking the entry of RAG into a stage of rapid development. With the performance of open-source Large Language Models (LLMs) approaching that of commercial closed-source models and breakthroughs in Long Context technology, the foundational applications of RAG have become popularized. Furthermore, the dialectical relationship between Long Con- text technology and RAG—characterized by synergy and complementarity—has been clarified in both theory and practice [32]. 4 3 4 18 48 2022 2023 2024 2025 0 10 20 30 40 50 60 Published Papers (pcs.) Year (a) The exponential increase in RAG secu- rity research from 2022 to 2025. This figure highlights the dramatic shift in academic attention toward the security aspects of Retrieval-Augmented Generation. While the topic received minimal attention prior to 2024 (averaging fewer than 5 papers annu- ally), the publication count surged signifi- cantly to 18 in 2024 and further skyrocketed to 48 in 2025. The steep trend line clearly indicates that RAG security has rapidly transitioned from a niche topic into a main- stream, critical research frontier within the broader large language model (LLM) com- munity. Data Poisoning Attacks 28.4% Data and Model Protection 6% Reliability and Security 7.5% Privacy Preservation 10.4% Defense Frameworks 14.9% Indirect Attacks 1.5% Cryptographic Defenses 3% Benchmarks 7.5% Embedding Inversion Attacks 3% Adversarial Attacks 7.5% Membership Inference Attacks 10.4% (b) Composition of RAG security literature by specific threat and protection categories. The chart demonstrates a clear focus on offensive research within the current aca- demic landscape. “Data Poisoning Attacks” emerges as the most prominent research area (28.4%), significantly overshadowing other specific attack and defense vectors. Defen- sive studies, indicated by the blue slices, are led by generalized “Defense Frameworks” (14.9%) and “Privacy Preservation” (10.4%). Overall, the chart visualizes the current phase of RAG security research, which is still primarily heavily concentrated on identifying diverse attack surfaces rather than consoli- dating unified defenses. Fig. 1: Overview of RAG security research trends and literature composition To address three core challenges—difficulty in processing unstructured data, low recall rates, and the semantic gap—a number of breakthrough technologies have emerged in the field. In terms of document parsing, multi-modal parsing tools such as DeepDoc, MinerU, and Docling have risen to prominence, driving document intelligence technologies based on generative architectures to gradually replace tra- ditional computer vision models. Regarding retrieval strategies, hybrid search modes fusing dense vectors, sparse vectors, and full-text retrieval have become the main- stream paradigm, causing the independent value of pure vector databases to gradually decline [32]. To effectively bridge the semantic gap, GraphRAG [18] and its deriva- tive architectures (including FastGraphRAG, LightRAG, and LazyGraphRAG) have improved system understanding by constructing knowledge graphs and reinforcing entity associations. Meanwhile, research such as RAPTOR and SiReRAG [33] has further optimized recall performance in multi-hop QA and fuzzy query scenarios [34]. In terms of ranking mechanisms, model architectures are evolving from traditional Cross-Encoders to Late Interaction models based on tensors, such as ColBERT and ColPali. Databases like Infinity and Vespa provide native tensor support, achieving an 5 Research on RAG Security Security Threats to RAG（§3） Data Poisoning Attacks（§3.1） Model Optimization: [38][43] [55][56][57] Reverse Engineering: [51] Perplexity: [48] Federated Learning: [50] Contrastive Learning: [46][47] Opinion Manipulation Attacks: [131] Vision Model Poisoning: [52] Chain-of-Thought Poisoning: [44] Gradient Attacks: [53] Graph RAG Attacks: [54] Fine-Tuning Injection: [35] Attention Variance: [58] Knowledge Base Corruption & Malicious Injection Membership Inference Attacks（§3.2） Reference Set Normalization: [68] Difficulty Calibration: [70] Prompt Engineering: [26] Semantic Similarity: [71][73] Masking: [72] Vector Database Privacy Breach & Data Extraction Adversarial Attacks（§3.3） Opinion Manipulation Attacks: [27][79][81] Reinforcement Learning: [80] Model Optimization: [132] Query-Time Semantic Evasion & Model Jailbreaking Other Security Risks（§3.4） Embedding Inversion Attacks: [41][72][82][83] Indirect Attacks: [39][84][85] [86][87][88][89] Latent Space Exploitation & Embedding Reversal RAG Security Defenses（§4） Data Privacy Enhancement and Security Admission（§4.1） Access Control: [90][91] Data Cleansing and Filtration: [43][49][58][96] Data Encryption Protection: [92][93][94][95][133][134][135] Data Fencing: [136] Reliability Security: [117][137] [138][139][140] Input-Side Integrity & Pre- Generation Sanitization Inference Defense and Information Leakage Protection（§4.2） Data Masking: [114][90] Federated Learning: [109][110] [111][112][113] Differential Privacy: [102][103] [104][105] Output-Side Confidentiality & Post-Generation Anonymization Defense Frameworks Adversarial Defense: [101] Einstein Trust Layer: [90] Provable Security: [111] Federated Learning: [141] Domain Fine-Tuning: [41][137] [142] Homomorphic Encryption: [92] Defense Frameworks: [30] End-to-End Robust Architecture & Provable Security Security Evaluation Standards for RAG Systems（§5） Evaluation Standards for Defenses Adversarial Attack Defenses: [96] Cognitive Fairness Defense Evaluation: [116] Evaluation Standards for Attacks Chinese Evaluation Standards: [117] Poisoning Attack Evaluation: [118] Comprehensive Security Benchmarking & Vulnerability Assessment Fig. 2: Structured taxonomy mapping the landscape of RAG security research. The diagram organizes the field into Threats, Defenses, and Evaluation Standards, cascading down to specific methodologies and key literature citations. To provide architectural context, the right-hand column groups these methodologies based on where they operate within the RAG pipeline, differentiating between attacks targeting specific modules (e.g., Retrievers) and defenses applied at specific stages (e.g., Full- Pipeline Defenses). 6 effective balance between ranking precision and computational cost [35]. Additionally, Agentic RAG has become a focal point in the industry. Frameworks like LangGraph endow systems with closed-loop reflection capabilities, which, combined with rea- soning frameworks such as Multi-Agent collaboration and Monte Carlo Tree Search (MCTS), effectively expand the boundaries of processing in complex scenarios. The deep integration of Memory management with RAG has also become a key evolution- ary direction [36, 37]. With the deep iteration of Vision-Language Models (VLMs) like GPT-4o and PaliGemma, Multi-modal RAG is rising rapidly, forming parallel techni- cal pathways of direct vector generation and generalized OCR-to-text conversion [35]. Simultaneously, the application of data cleaning technologies such as Late Chunking and Contextual Chunking continues to improve the quality of data ingestion. Col- lectively, these technical advancements are driving the deep implementation of RAG technology into complex enterprise-level application scenarios. This section provides an overview of Retrieval-Augmented Generation technology. The core objective is to deconstruct its complete technical workflow and deeply analyze the inherent security mechanism defects within each stage, laying a solid theoretical and technical foundation for the subsequent classification of security threats, discus- sion of defense technologies, and proposal of relevant solutions. The general technical workflow of RAG is outlined below. As shown in Figure 3, the technical workflow of RAG primarily consists of three core modules: Vector Database Construction, the Retriever, and the Generator. Vector Database Construction Phase: This phase aims to transform external knowl- edge into a retrievable vector index. Unlike the pre-training data of general LLMs, the external knowledge base of RAG emphasizes specificity, real-time availability, and privacy. The database construction process mainly involves two key steps: Chunking and Embedding. Chunking: This step is responsible for slicing multi-dimensional and multi-modal data into semantic units suitable for model processing. The chunking strategy requires a balance in granularity: chunks that are too large may cause the model’s attention to disperse, affecting the extraction of central concepts within the text block; conversely, chunks that are too small may disrupt the semantic integrity of the text. Embedding: Building on chunking, an embedding model is used to encode data segments, mapping them into high-dimensional vectors. Data segments processed in this manner possess advanced semantic representation capabilities, establishing a foundation for improving the accuracy and relevance of subsequent retrieval tasks. The Retriever: The Retriever is a critical component of RAG, with the goal of identifying and recalling the Top-k data chunks most semantically relevant to the user’s query. To balance efficiency and accuracy, RAG employs a dual mechanism of retrieval followed by re-ranking to recall data segments. First, the Retriever embeds the user’s query to obtain its vector representation. Subsequently, utilizing similarity metrics within the semantic space—where semantically similar texts are closer in vec- tor distance—it performs a preliminary screening of texts semantically close to the user’s query. Common metrics include Cosine Similarity, Euclidean Distance, and Dot Product. Finally, a Reranker model performs a finer-grained re-ordering based on the 7 semantic similarity between the query and the data segments, selecting the final Top-k chunks and retrieving the original data content based on index information. The Generator: After extracting the Top-k relevant data segments, the Generator utilizes Prompt Engineering techniques to integrate the user query and the retrieved context fragments into a structured Prompt. This prompt is then input into the Large Language Model. The model combines the retrieved external knowledge with its inter- nal parametric knowledge to ultimately output a high-quality, knowledge-augmented response. However, while RAG technology expands knowledge boundaries, its complex “Retrieval-Generation” collaborative mechanism also introduces new vulnerabilities. As illustrated in Figure 3, during the flow of data through the Vector Database Con- struction, Retriever, and Generator stages, a compromise in any single link can lead to the collapse of the entire system’s security perimeter. To more comprehensively evaluate the practical application risks of RAG, the following sections will detail the major security threats faced by this technical architecture in real-world deployments. Real-time dataReal-time data Professional data Professional data Private dataPrivate data Top-k chunks Query embedding ChunksChunks Sharding Embedding vector Vector Database Construction Embedding model Retrieval Retrieval Component LLM s Prompt Template Generation Component Prompt User's query User External data RerankRerank Answer Generate Embedding model Vector database Fig. 3: RAG technical workflow: i) Vector database construction, which involves cal- culating semantic vectors for data chunks via data chunking and embedding models to establish the vector database ; i) Retriever, responsible for retrieving the top-k data chunks most relevant to the user query from the database ; i) Generator, responsible for integrating the top-k data chunks with the user query and submitting them to the large language model for response generation . 8 3 Security Threats Facing RAG Technology As illustrated in Figure 4, Retrieval-Augmented Generation (RAG) systems face multi- dimensional security threats in practical deployments, which severely compromise the reliability, integrity, and confidentiality of the content generated by the system. To comprehensively analyze the security risks of RAG, this section systematically catego- rizes these threats into three main classes based on the distribution of attack targets within the technical workflow. First are attacks targeting the Vector Database Construction phase, primarily man- ifesting as data poisoning attacks [38] and indirect attacks [39] aimed at undermining the purity of external knowledge sources. Second are attacks targeting the Retriever component, covering adversarial attacks [40] that attempt to manipulate retrieval ranking, and embedding inversion attacks [41] intended to steal the privacy of vec- tor representations. Finally, there are attacks targeting the Generator module, with a focus on membership inference attacks [42] that exploit model outputs to infer private data. Based on the severity of the threats, this section will specifically elaborate on data poisoning attacks, adversarial attacks, and embedding inversion attacks. 3.1 Data Poisoning Attacks The essence of data poisoning attacks lies in exploiting the high dependency of RAG technology on external knowledge to manipulate the system’s final behavior by pol- luting data sources. Unlike traditional model parameter attacks, data poisoning does not require accessing or modifying the internal weights of Large Language Models (LLMs). An attacker merely needs to inject carefully constructed malicious text into the knowledge base to compromise the authenticity and reliability of the RAG out- put. This attack method leverages the blind spots of the retrieval module in semantic matching and the over-trust of the generation module in context understanding. Con- sequently, the system is induced to generate misleading, biased, or harmful content predefined by the attacker when processing specific queries, posing a severe threat to the security of RAG applications in sensitive scenarios such as finance, healthcare, and public opinion guidance. 3.1.1 Attack Principles Figure 5 illustrates the mechanism of data poisoning attacks. Attackers craft malicious texts and inject them into the RAG system to compromise the trust chain, ensuring these texts appear in the retrieval results for target queries and ultimately alter the generated output. To guarantee a substantial impact, the crafted samples must simul- taneously satisfy two key constraints. The retrieval condition requires the injected text to exhibit high similarity to the target query within the semantic vector space, thereby deceiving the retriever into recalling it as a Top-k result. Concurrently, the generation condition dictates that the malicious text, once integrated into the con- text, must be highly misleading. It must override the internal prior knowledge of the model and induce the generator to produce biased or incorrect answers predefined by the attacker [38]. A complete attack chain that precisely manipulates the RAG system can only be constructed when both conditions are fulfilled. 9 Real-time data Real-time data Professional data Professional data Private data Private data Top-k chunks ⑤Query embedding ChunksChunks ①Sharding Embed ding vector Vector Database Construction Embedding model ③Retrieval Retrieval Component ⑥Prompt Template Generation Component Prompt User's query User External data ④ Rerank ④ Rerank Answer ⑦Generate ②Embedding model Vector database Data Poisoning Attacks Indirect Prompt Injection Indirect Jailbreak attacks Injecting toxic text into external data sources attack path：⑤->③->④->⑥->⑦ Data Poisoning Attacks Indirect Prompt Injection Indirect Jailbreak attacks Injecting toxic text into external data sources attack path：⑤->③->④->⑥->⑦ Embedding Inversion Attacks Reconstructing sensitive data from embedding vectors attack path：②->① Embedding Inversion Attacks Reconstructing sensitive data from embedding vectors attack path：②->① Membership Inference Attacks Inferring sensitive data within the database via RAG system responses attack path：⑦->⑥->④->③->⑤ Membership Inference Attacks Inferring sensitive data within the database via RAG system responses attack path：⑦->⑥->④->③->⑤ Adversarial Attacks Disrupting model responses via adversarial perturbations attack path：⑤->③->④->⑦ Adversarial Attacks Disrupting model responses via adversarial perturbations attack path：⑤->③->④->⑦ LLMs Fig. 4: Security threats to RAG systems: i) Data poisoning attacks, where attackers inject malicious data into the database to manipulate output results ; i) Indirect attacks, where attackers utilize external data as a carrier to inject payloads targeting the large language model, such as prompt injection or jailbreaking, to compromise the model ; i) Embedding inversion attacks, which are methods that reconstruct original data from embedding vectors ; iv) Adversarial attacks, which target the retrieval logic by injecting imperceptible perturbations into the data to disrupt model responses ; v) Membership inference attacks, which infer the presence of sensitive data within the database based on features such as confidence scores in RAG responses . 3.1.2 Evolution of Attacks Regarding the attack targets, data poisoning attacks primarily focus on two core mod- ules, specifically the retriever and the generator.Addressing the aforementioned attack mechanisms, the academic community has proposed various attack frameworks(as 10 Algorithm 1 General Algorithm for Data Poisoning Attacks Require: Target query Q target preset by the attacker; Target malicious/biased response A malicious ; Original Knowledge Base KB; Retrieval Model M R and Generation Model M G . Ensure: Poisoned text sample D poison for injection. 1: D retrieval ← LLMGenerate(Q target )▷ Generate text highly semantically relevant to Q target 2: D generation ← LLMGenerate(A malicious ) ▷ Generate persuasive text containing A malicious to induce M G to output A malicious given D poison 3: D poison ← Optimization(D retrieval ,D generation ) ▷ Merge and optimize text to enhance retrieval relevance and generation probability 4: D poison ← SmoothandDisguise(D poison )▷ Mitigate malicious features to approximate benign text 5: KB new ← KB∪D poison ▷ Inject the poisoned text into the database 6: O system ←M G (M R (Q target ,KB new )) 7: if O system ≈ A malicious then 8:return “Attack Successful”▷ Verify the attack success 9: end if shown in Table 2), evolving from early heuristic splicing to bi-level optimization and multi-modal domains. PoisonedRAG [38], a representative early work, adopted heuristic strategies to separately generate sub-texts satisfying retrieval and generation conditions and then spliced them together. Although effective, this approach often resulted in reduced text fluency, making it susceptible to identification by defense mechanisms. To enhance attack stealthiness, PR-Attack [43] modeled the attack generation process as a complex bi-level optimization problem. By employing alternating iterative optimization strate- gies, it significantly reduced the perplexity of malicious text while ensuring attack success rates, rendering the text closer to natural language. From the perspective of vector space, the BRRA framework [25] is more aggres- sive. It directly manipulates retrieval ranking in the embedding space by maximizing the projection of toxic documents in the direction of the target query’s embedding vector. Combined with generation guidance and self-reinforcement loop mechanisms, it continuously amplifies the model’s biased expression during the generation phase. Furthermore, as RAG reasoning capabilities enhance, attack methods have become more sophisticated. For instance, [44] proposed a poisoning attack targeting the Chain- of-Thought (CoT) in R1-based RAG systems, which possess deep reasoning abilities. By extracting the target system’s reasoning paradigm and constructing adversar- ial samples following a “query-wrong answer-fake CoT” pattern, they successfully induced the system into erroneous logical deduction traps. In the multi-modal domain, MedThreatRAG [45] revealed the severe vulnerabilities of medical AI systems when processing cross-modal data by injecting adversarial image-text pairs. 11 Top-K Chunks: [...] Wang Jianlin [...] founded Alibaba in 2008 ... Query embedding ChunksChunks Sharding Embedding model Retrieval LLM s Prompt Template User RerankRerank Target Question: Who is the founder of Alibaba? Target Response: Jianlin Wang Malicious Text: [...] Wang Jianlin [...] founded Alibaba in 2008 Data inject User's query：Who is the founder of Alibaba? Context: [...] Wang Jianlin [...] founded Alibaba in 2008 Question：Who is the founder of Alibaba? Please generate an answer based on the provided content. Answer：Jianlin Wang Vector Database Construction Real-time data Professional data Professional data Private data External data Embedding vector Vector database Embedding model Generate Retrieval Component Generation Component Data Poisoning Attacks Attacker Fig. 5: Mechanisms of data injection and propagation in RAG poisoning attacks. This figure details how an attacker successfully manipulates the final output of an LLM by compromising the RAG system’s knowledge base. The architecture is divided into data construction, retrieval, and generation stages. The critical vulnerability occurs during data construction (purple block), where malicious data is covertly injected and processed into embedding vectors. Consequently, the standard retrieval mechanism (green block) is weaponized; it actively fetches the poisoned chunk as context for the user’s query. As indicated by the red flow path, the generation component (orange block) blindly trusts this manipulated context, resulting in the successful execution of the attack and the delivery of fabricated information to the user. 3.1.3 Our insight In actual deployment environments, the implementation path of data poisoning exhibits extreme stealthiness and asymmetry. Attackers often adopt a “passive poi- soning” strategy, meaning they do not need to directly infiltrate the system database. Instead, they simply post adversarial text containing specific trigger words or misinfor- mation on internet platforms such as Wikipedia, social media, or public forums. Once this content is crawled and indexed by RAG’s web crawlers, the attack is effectively executed. To evade existing defense detection, attackers are working to optimize the stealth- iness of malicious text. They utilize Generative Adversarial Networks (GANs) or paraphrasing models to smooth and disguise the text, making its statistical features and perplexity infinitely approach those of normal human language. Given that exist- ing retrievers generally lack fact-checking mechanisms—focusing solely on semantic 12 Table 2: Representative Works on Data Poisoning Attacks PaperMethodTarget ModelsAdvantages TrojanRAG [46] Contrastive Learning Gemma, LLaMA-2, Vicuna, ChatGLM, GPT-3.5-Turbo, GPT-4, DPR, BGE- Large-En-V1.5, UAE-Large-V1 Combined with knowledge graphs to improve the recall rate of malicious text. BadRAG [47] Contrastive Learning Not MentionedIntroduced trigger words into malicious text to enhance stealthiness. AESP [48] Perplexity OPT, BloomDiscovered a negative correlation between perplexity and the performance of poison- ing attacks. DLMA [49] Perplexity GPT-2Utilized perplexity combined with a prompt length classifier to resolve the high false positive rate of single-perplexity detection. FedGhost [50] Federated Learning FC, AlexNet, CNN, VGG16 BNAchieved unsupervised attacks. Malla [51]Reverse Eng. OpenAI GPT-3.5, GPT-4, Davinci-002, Davinci-003, Anthropic Claude-instant, Claude-2-100k, Pygmalion-13B, Luna AI Llama2 Uncensored Exposed the vulnerabilities of Malla. BRRA [25] Reinforce. Learning GPT-4o-mini, DeepSeek-R1, Qwen-2.5- 32B, Llama-3-8B, BM25, E5-base-v2, E5-large-v2 Made poisoning attacks harder to defend against by amplifying the projection of malicious documents in the semantic prompt subspace. Spa-VLM [52] Visual Model Poisoning Eva-CLIP, OpenAI-CLIP, Q-Former, Mistral-7B-Instruct-v0.2, LLaMA-8B- Instruct, InternVL2-8B Targeted multimodal RAG for poisoning attacks. CoTPA [44] Chain-of- Thought Qwen2.5-7B, Qwen-7B-R1-distilled, Deepseek-R1, QwenCo-Condenser Improved attack success rates against reasoning models by mimicking the Chain- of-Thought templates of R1-based RAG systems. Joint-GCG [53] Gradient Attack Llama3-8B, Qwen2-7B, Contriever, BGE- base-en-v1.5 Improved poisoning attack success rates by unifying the gradient optimization of the retriever and generator via cross- vocabulary projection and alignment. PoisonedRAG [38] Optimization Model PaLM 2, GPT-4, GPT-3.5-Turbo, LLaMA- 2, Vicuna, Contriever, Contriever-ms, ANCE, HotFlip, TextFooler Induced two necessary conditions for poisoning attacks; regarded as an author- itative document in the field of RAG poisoning attacks. RAG Safety [15] GraphRAG Attack RoG, GCR, G-retriever, SubgraphRAG, GPT-4, LLaMA-2-7B-hf Summarized poisoning attacks targeting GraphRAG. Backdoor Attacks [54] Fine-tuning Injection LLaMA3.1-8B-Instruct, Qwen2.5-7B- Instruct, Gemma-2B-IT, GPT-4o, gte- large-en-v1.5 Embedded backdoors into prompts at three granularities: word-level, syntax- level, and semantic-level. CorruptRAG- AK [55] Optimization Model GPT-3.5-turbo, GPT-4o-mini, GPT-4o, GPT-4-turbo Implemented poisoning attacks using malicious prompt templates and malicious knowledge. Phantom [56] Optimization Model Gemma-2B, Gemma-7B, Vicuna-7B, Vicuna-13B, Llama3-8B, GPT-3.5 Turbo, GPT-4; Contriever, Contriever-MS, DPR Proposed a two-stage optimization frame- work, enhancing the stealthiness of poisoning attacks. CPA-RAG [57] Optimization Model GPT-3.5, GPT-4o, DeepSeek, Qwen-Max, Qwen2.5-7B, LLaMA2-7B, Vicuna-7B, InternLM-7B, Contriever, ANCE, DPR, Qwen3 Achieved high attack success rates via prompt-based text generation, multi-LLM cross-guided optimization, and retriever scoring. PR-Attack [43] Optimization Model Vicuna 7B, LLaMA-2 7B, LLaMA-3.2 1B, GPT-J 6B, Phi-3.5 3.8B, Gemma-2 2B, Contriever Pioneered a dual-prompt collaborative attack paradigm. AV Filter [58] Attention Variance Llama2-7B-Chat, Mistral-7B-Instruct, GPT-4o, Contriver First to discover the phenomenon of anomalously high attention in malicious samples. relevance—and LLMs struggle to distinguish “false facts” contained within the con- text from “real knowledge,” this dual defect makes stealth-optimized poisoning attacks extremely difficult to intercept via traditional firewalls based on rules or anomaly detection. This constitutes the most significant vulnerability in current RAG security defense systems. 3.2 Membership Inference Attacks Membership Inference Attacks (MIA), a typical privacy threat in the field of machine learning, have long focused on inferring whether specific samples were used to train 13 machine learning models [42, 59–61], federated learning models [62, 63], or LLMs [64–67]. The core logic involves identifying statistical discrepancies in model output confidence between training samples (members) and non-training samples (non- members) [42]. However, with the proliferation of RAG architectures, the attack boundary of MIA has shifted significantly. In the RAG domain, the attacker’s target moves from model training data to the more dynamic external knowledge base. By crafting specific queries, attackers attempt to probe whether target documents exist within the RAG retrieval corpus, thereby stealing enterprise private data or sensi- tive user privacy. This novel MIA targeting RAG exploits the “Retrieval-Generation” mechanism—specifically, when queried content exists in the knowledge base, the sys- tem tends to generate answers with higher accuracy, lower perplexity, and stronger semantic consistency. This quantifiable response difference provides a side channel, enabling attackers to efficiently determine knowledge base membership by setting thresholds or training classifiers, leading to severe privacy leakage risks. 3.2.1 Attack Principles Regarding the attack targets, membership inference attacks primarily focus on two core modules, specifically the database and the generator.The general implementation logic of MIA in RAG, as outlined in Algorithm 2, is theoretically founded on the core hypotheses of “memorization” and “confidence bias.” Existing research indicates that when models process member samples contained in the training set or knowledge base, they typically exhibit performance metrics significantly superior to non-member sam- ples, including higher prediction confidence, lower generation perplexity, and stronger semantic relevance. Attackers exploit this characteristic to quantify the probability of a target sample belonging to the knowledge base by calculating a “Membership Score.” Figure 6 illustrates the mechanism of membership inference attacks. Specifically, the attacker transforms the target document into a query input for RAG and observes the system’s retrieval behavior and generation quality. If the system accurately retrieves relevant context and generates a highly matching response, the sample’s membership score will exceed a set threshold. Essentially, this mechanism exploits the “knowledge existence” signal exposed by RAG to enhance answer accuracy, converting system utility metrics into feature vectors for privacy leakage. 3.2.2 Evolution of Attacks Addressing MIA against RAG, the academic community has proposed various frame- works to overcome the limitations of traditional methods in complex scenarios(as shown in Table 3). [26] pioneered a direct query-based probing method, validating the possibility of determining context existence through generated content. Building on this, the S²MIA framework [68] introduced semantic similarity metrics. It cleverly splits the target sample into query text and remaining text, utilizing the differences in BLEU scores and perplexity between the generated content and the original sample to construct a membership score, achieving determination via threshold or shadow model methods. However, attacks based directly on similarity often overlook the inherent dif- ficulty of the sample itself—certain general knowledge may be answered by the LLM even if absent from the repository. To address this “similarity-difficulty confusion,” the 14 Algorithm 2 General Algorithm for Membership Inference Attacks Require: Target test sample X target ; Target RAG SystemS RAG ; Reference Language Model M Ref ; Attack Strategy Strategy. Ensure: Boolean value ismember (True if sample is in KB). 1: function ComputeMemberScore(X target ,Strategy) 2: Q← Preprocessing(X target )▷ Process raw text 3: R response ←S RAG .Query(Q)▷ Obtain RAG output 4: S ← Scoring(R response ,X target )▷ Calculate membership score 5:return S 6: end function 7: Initialize decision threshold τ▷ Obtained via validation set experiments 8: S target ← ComputeMemberScore(X target ,Strategy) ▷ Calculate score for target sample 9: if Strategy is Model-based then▷ Use trained binary classifier 10: ismember ← Classifier.Predict(S target ) 11: else▷ Use threshold-based determination 12:if S target > τ then 13:ismember ← True 14:else 15:ismember ← False 16:end if 17: end if 18: return is member DC-MIA framework [69] proposed a calibration strategy. This framework adopts a two- stage inference mechanism: high-similarity responses are directly classified as members, while for ambiguous samples with medium similarity, Likelihood Ratio Calibration is employed to eliminate interference from the model’s general knowledge. This signifi- cantly improves inference precision for difficult samples, revealing the vulnerability of defenses that rely solely on semantic matching. 3.2.3 Our insight To further enhance attack precision and impact, recent research has focused on fine- grained entities and reconstruction-based inference attacks. The MBA framework [42] discards traditional overall similarity comparison in favor of a “mask-predict” paradigm. It uses a proxy model to identify hard-to-predict keywords or phrases in a document for masking, then requests the target RAG system to fill in the blanks. If the target system restores the masked content with high precision, it suggests the document likely exists in the knowledge base, effectively exploiting RAG’s context completion capability as an attack vector. Simultaneously, addressing Personally Iden- tifiable Information (PII) leakage, the EL-MIA framework [68] drills down the attack granularity from the document level to the entity level. This study constructed a benchmark dataset containing sensitive information such as names and phone numbers and proposed two innovative methods: reference set normalization and suffix scoring. 15 Table 3: Membership Inference Attacks against RAG Systems PaperMethodTarget ModelsAdvantages EL-MIA [68]Reference Set Nor- malization PythiaProposed entity-level membership risk dis- covery for sensitive information. RAG-leaks [70] Difficulty Calibra- tion Meta-Llama-3-8B- Instruct, Mistral- 7B-Instruct-V02, glm-4-9b-chat; all- MiniLM-L6-v2, BGE-en, BM25; FAISS Addressing the phe- nomenon where question difficulty correlates with LLM accuracy, it calibrates membership for sam- ples with similar raw similarity scores via likelihood ratio tests. Anderson [26]Prompt Engineer- ing google/flan-ul2, meta-llama/llama- 3-8b-instruct, mistralai/mistral-7b- instruct-v0-2; sentence- transformers/all- minilm-l6-v2; Milvus Lite Achieved membership inference attacks via prompt engineering, making the attack effi- cient and easy to use. SPRD [71]Semantic Similarity Llama 3.2 3B Instruct, Llama 3.1 8B Instruct, Phi-4 Mini Instruct, BGEm3, GTE Large En v1.5, FAISS, GPT-4o, GPT-4o-mini Detected entries within the query and the database based on semantic similarity; the strategy is simple and straightforward. MBA [72]MaskingGPT-4o-mini, GPT- 3.5-turbo, Gemini-1.5, BAAI/bge-smallen, FAISS, GPT2-xl, oliverguhr/spelling- correction-english-base Determined the exis- tence of members based on the LLM’s prediction accuracy for masked prompts. S2MIA [73]Semantic Similarity LLaMA-2-7b-chat-hf, LLaMA-2-13b-chat-hf, Vicuna, Alpaca, GPT- 3.5-turbo; Contriever, DPR Utilized the seman- tic similarity between the target sample and RAG-generated content, along with generation perplexity, as member- ship features. 16 Real-time data Real-time data Professional data Professional data Private dataPrivate data Top-k data blocks Query embedding Chunk s Chunk s Sharding Embedding vector Vector Database Construction Embedding model Retrieval Retrieval Component LLM s Prompt Template Generation Component Prompt Malicious Query External data RerankRerank Answer Generate Embedding model Vector database AttackerAttacker Target Document Output: Is the target in the database? Query Construction Membership Score Calculation Fig. 6: The workflow of Membership Inference Attacks (MIA) against RAG systems. This diagram illustrates how attackers exploit the retrieval and generation processes to compromise data privacy. Unlike poisoning attacks, the vector database remains unaltered. Instead, the attacker starts with a specific “Target Document” and carefully crafts a “Malicious Query” designed to trigger the retrieval of this document. The RAG system processes this query normally through its components, ultimately generating an answer. Crucially, the attacker then performs a “Membership Score Calculation” based on the LLM’s output characteristics (such as exact string matching or confidence scores). By analyzing this score, the attacker can successfully infer whether the target private document was originally included in the system’s vector database (Output: Yes/No), thereby causing a severe privacy leak. By comparing the model likelihood of candidate entities in RAG against deviations in a general reference set, it successfully achieved precise localization of sensitive enti- ties. These studies demonstrate that privacy leakage risks in RAG exist not only at the macroscopic document level but have also permeated to microscopic data fields, posing severe challenges to data compliance. 3.3 Adversarial Attacks Adversarial Attacks, originating from the field of Computer Vision, refer to meth- ods that induce deep neural networks into making erroneous judgments by adding minor perturbations to input data that are imperceptible to human senses. In the context of RAG systems, this attack evolves into a high-level threat targeting the Nat- ural Language Processing pipeline. Its core mechanism exploits the non-robustness of the Retriever and Generator to specific semantic features. By applying gradient- optimized discrete symbolic perturbations to documents or queries, attackers can precisely manipulate system outputs while maintaining text semantic fluency and nat- uralness [40]. Unlike traditional random noise, Adversarial Examples targeting RAG 17 typically possess explicit malicious intent, aiming to breach system defense boundaries. This may manifest as inducing the retriever to erroneously rank malicious documents at the top (Rank Manipulation) or triggering hallucinations and harmful outputs from the LLM during the generation phase. Such attacks exploit “adversarial blind spots” of neural network models in high-dimensional vector spaces, undermining the reliability and stability of RAG systems when facing malicious inputs. Due to the extreme stealthiness of these perturbations, traditional defenses based on rule filtering or semantic consistency detection are often ineffective [40, 74]. 3.3.1 Attack Principles Adversarial attacks against RAG systems can be modeled as an optimization prob- lem under multi-objective constraints. The general process, as shown in Algorithm 3, centers on finding the optimal perturbation vector to maximize attack utility while minimizing detectability risks. Figure 7 illustrates the mechanism of this attack, attackers typically employ gradient-based search strategies or heuristic algorithms to iteratively optimize target documents. On one hand, the attacker needs to calculate the Retrieval Reward, which involves modifying document features to move them closer to the target query in the vector space, thereby deceiving the retrieval model into including them in the Top-k candidate set. On the other hand, the attacker must optimize the Generation Reward, ensuring that once the document is fed into the large model as context, it effectively activates specific internal parameter paths to induce the model to output a predefined erroneous or malicious response. How- ever, this process faces strict constraints: the perturbed text must maintain semantic coherence and grammatical correctness. Typically, Perplexity or Semantic Similarity are introduced as penalty terms in the loss function. Therefore, a successful adver- sarial attack essentially involves balancing retrieval recall rate, generation induction rate, and text stealthiness, achieving a fine equilibrium between destructive power and imperceptibility through precise perturbations. 3.3.2 Evolution of Attacks From the perspective of attack targets, adversarial attacks primarily focus on two core modules: the Retriever and the Generator. Attacks targeting the Retriever emphasize improving sample retrieval ranking. By injecting adversarial triggers or optimizing embedding vectors within documents, attackers ensure that documents containing malicious information receive extremely high relevance scores when matched with specific queries, thereby displacing authentic documents in the rankings [75–77]. Con- versely, attacks targeting the Generator focus on polluting the context. They exploit the large model’s excessive attention to specific patterns within the context to inject adversarial prompts capable of misleading reasoning logic. Early research mostly con- centrated on simple malicious document injection [56] or prompt injection [78], but these methods were often easily identified by defense systems due to rigid text and disjointed logic. As illustrated in Table 4 ,subsequent research has begun to explore deeper attacks. For instance, poisoning attacks targeting the database focus not only on damaging single documents but also on attempting to construct complex networks 18 Real-time dataReal-time data Professional data Professional data Private dataPrivate data Top-k chunks Query embedding ChunksChunks Sharding Embedding vector Vector Database Construction Embedding model Retrieval Retrieval Component LLM s Prompt Template Generation Component Prompt User's query User External data RerankRerank Answer Generate Embedding model Vector database Target Question Generative Model Q1 Q2 Qm Q1 Q2 Qm ... Questions Semantically Similar to the Target Question According to [...], temperatures will drop from tonight through tomorrow [...]; please keep warm. Original Text Key Text Node Extraction Phrase Insertion Sentence Rewriting Word Substitution Multi-Granularity Text Processing Simulated Target Retriever Attack Result Determination Satisfying Target Polarity Deviating from Target Polarity Reward Function According to [...], temperatures will drop from this evening through tomorrow [...]; please take measures to keep warm. Malicious Text According to [...], temperatures will drop from this evening through tomorrow [...]; please take measures to keep warm. Malicious Text Adversarial Attacks temperatures tomorrow tonight According to please Fig. 7: Optimization and injection process of adversarial text within a RAG architecture. This diagram highlights the algorithmic complexity behind adversarial attacks compared to simple data poisoning. The top panel illustrates an automated, optimization-driven generation loop. By continuously testing perturbed text against a simulated retriever and scoring it via a reward function, the attacker refines the pay- load until it successfully deviates from the target polarity while maintaining semantic similarity to the original text (e.g., subtly changing “tonight” to “this evening”). Upon successful generation, this highly optimized adversarial example is injected into the vector database (red arrow), silently weaponizing the standard retrieval-generation pipeline against the end-user. 19 Algorithm 3 General Algorithm for Adversarial Attacks Require: Target RAG System S RAG (comprising Retriever R and Generator G); Original Knowledge Base KB; Target Query or Topic Q target ; Malicious Intent I malicious . Ensure: Perturbed stealthy adversarial document D adv . 1: D adv ← Initialize(D orig )▷ Initialize adversarial document (original or with trigger words) 2: for iteration = 1 to maxiterations do▷ Iterative optimization process 3: S retrieval ←S RAG .R.score(Q target ,D adv ) ▷ Calculate score aiming for top-k ranking 4: R gen ←S RAG .G.generate(D adv ,Q target ) 5: S generation ← CalculateSimilarity(R gen ,I malicious ) ▷ Measure consistency between output and malicious intent 6: S stealth ← EvaluateNaturalness(D adv ) 7: L semantic ← Distance(D orig ,D adv ) ▷ Evaluate naturalness (e.g., PPL) and semantic deviation 8: L total ←−(w 1 · S retrieval + w 2 · S generation ) + w 3 · (S stealth + L semantic ) ▷ Joint loss: maximize utility, minimize detection risk 9: ∇ perturb ← GetGradient(L total ,D adv ) 10: D adv ← ApplyPerturbation(D adv ,∇ perturb ) ▷ Find optimal perturbation direction 11: D adv ← RefineForNaturalness(D adv ) ▷ Dynamic polishing for human-like expression 12:if AttackSuccess(D adv ) and IsStealthy(D adv ) then 13:break▷ Check stopping criteria 14:end if 15: end for 16: return D adv of erroneous knowledge through the batch injection of collaborative adversarial sam- ples, thereby triggering systemic erroneous responses when the system retrieves specific topics. 3.3.3 Our insight To bypass detection by defense systems, current research on adversarial attacks places greater emphasis on stealthiness and dynamic adaptability. Addressing the issue of low naturalness in traditional attack texts, the ReGENT framework [75] proposed an end-to-end attack model. It constructs a surrogate retrieval model adapted to the target RAG and trains using Top-k relevant documents as positive examples. This approach identifies key positions within the document susceptible to perturbation with minimal modification and dynamically adjusts optimization goals by fusing three reward signals: retrieval, generation, and naturalness, achieving a balance between attack effectiveness and text fluency. The Topic-FlipRAG framework [81] introduced 20 Table 4: Related Work on Adversarial Attacks PaperMethodTarget ModelsAdvantages BMAR [79]Opinion Manipu- lation Attack LLAMA3-8B, Qwen1.5- 14B, coCondenser, MiniLM Improved stealthiness by training a surrogate model to simulate the RAG retriever, elim- inating the need for frequent anomalous access to the RAG sys- tem. FlippedRAG [27] Opinion Manipu- lation Attack Llama3, Vicuna, Mixtral, Contriever, Co-Condenser, ANCE, Nboost/pt-bert-base- uncased-msmarco, Qwen2.5-Instruct-72B, LangChain Generated targeted trig- gers to achieve opinion manipulation with high stealthiness. Silent Saboteur [80] Reinforcement Learning Co-Condenser, Contriever-ms; LLaMA- 3-8B, Qwen-2.5-7B, GPT-4o Utilized reinforcement learning with coarse- to-fine training of a surrogate model to sim- ulate the target system, resulting in minimal perturbation to the malicious text. RAG-ThiefOptimization Model ChatGPT-4, Qwen2- 72B-Instruct, GLM-4-Plus; nlp coromsentence- embeddingenglish-base Employed agents to implement automated adversarial attacks, reducing attack over- head. Topic-FlipRAG [81] Opinion Manipu- lation Attack Llama3.1, Qwen2.5; Contriever, DPR, ANCE Achieved opinion manipulation by gen- erating topic-specific triggers via semantic- level perturbation and gradient optimization. a knowledge-guided stealthy document modification strategy. Combining the rewrit- ing capabilities of LLMs with gradient-optimized adversarial trigger generation, it is specifically designed to manipulate the stance polarity of RAG outputs under specific topics, making the generated text appear more natural from a human perspective. Fur- thermore, PR-Attack [43] designed a more covert conditional trigger mechanism. By embedding specific toxic triggers in documents, malicious documents remain dormant under normal conditions and are activated and recalled by the retriever only when a user query contains the corresponding trigger. This mechanism not only enhances 21 the unpredictability of the attack but also increases the difficulty of security audit- ing and detection, marking the evolution of adversarial attacks toward intelligent environmental awareness and evasion capabilities. 3.4 Other Security Threats The aforementioned data poisoning, membership inference, and adversarial attacks cover the primary threats faced by RAG systems, but they do not represent the complete security landscape. Due to its complex modular architecture and the deep coupling mechanism of “Retrieval-Generation,” RAG systems are exposed to more covert and diverse derivative security risks. These risks mainly stem from the system’s high dependency on intermediate vector representations and the endogenous defects of LLMs in recognizing the intent of externally retrieved content. Attackers exploit these architectural characteristics to attempt not only to reverse-engineer original pri- vate information from mathematically seemingly irreversible embedding vectors but also to use the retrieval mechanism as a trust springboard to implement instruction injection and control flow hijacking through indirect contact. This section delves into these atypical security threats, focusing on Embedding Inversion Attacks targeting data representation privacy and Indirect Attacks targeting system interaction logic. 3.4.1 Embedding Inversion Attacks The core objective of Embedding Inversion Attacks is to reverse-engineer original text content from highly compressed low-dimensional vector spaces, directly threatening the confidentiality of vector databases in RAG systems. [41] first revealed that text embedding vectors are not absolutely secure “black boxes”; their internal semantic fea- tures are sufficient to support high-fidelity text restoration. This method innovatively introduced a Transformer-based decoder architecture combined with an iterative opti- mization scheme. By continuously calibrating the generated text sequence using the encoder’s output signals, it successfully reconstructed the original document text. In the operational logic of RAG systems, vast amounts of private knowledge base entries and user queries are converted into vector forms for storage or transmission to facilitate efficient retrieval. This research proves that once attackers intercept these intermedi- ate embedding vectors, they can use an inversion model to translate them back into the original sensitive documents or user queries without acquiring the model’s internal parameter weights, leading to severe privacy leakage. As RAG system architectures become increasingly complex, inversion attacks tar- geting single vectors have gradually become less effective. Consequently, academia has begun to explore leveraging RAG’s unique retrieval logic to enhance attack effective- ness. [82] proposed a context-inferred compound inversion attack strategy tailored for the common Multi-hop Retrieval scenarios in RAG systems. Instead of processing a single vector in isolation, this method exploits potential semantic associations between retrieval results. By employing a joint probability distribution optimization algorithm and using multiple related embedding vectors as context constraints, it significantly improves the logical coherence and readability of the reconstructed text. However, this general method often fails in vertical RAG systems (e.g., finance, healthcare) due to 22 the inability to accurately restore specific professional terminology. To address this,[83] proposed a domain-specific attack paradigm named BEI. Targeting high-frequency semantic vector databases in RAG systems, BEI utilizes self-supervised learning to construct pseudo-embedding pairs. It fine-tunes pre-trained language models (PLMs) without accessing the target model’s gradient information. This method effectively captures the semantic distribution features of vertical domains, significantly enhanc- ing the reconstruction accuracy of professional terms and rare vocabulary, making attacks against industry-grade RAG systems more precise. To further lower the attack barrier and break through black-box limitations, recent research has focused on improving the transferability and generalization capability of embedding inversion models. [72] proposed a highly transferable embedding inver- sion attack framework aimed at solving the practical challenge of being unable to directly query the target model or obtain large amounts of paired training data. This method adopts a surrogate model strategy, mimicking the behavioral characteristics of the target embedding model. Combined with consistency regularization optimization and adversarial training techniques, it successfully constructs an inversion generator capable of cross-model reuse. Attackers only need to use a small number of leaked document-embedding pairs as seed data to train a generalized attack model, enabling attacks on completely unknown target RAG systems. This demonstrates the univer- sality of the embedding inversion threat: attackers do not need to fully replicate the target system’s environment but can extract sensitive original text from vector data using only limited side-channel information, forcing security researchers to re-examine the security boundaries and protection mechanisms of vector databases within RAG architectures. 3.4.2 Indirect Attacks Indirect Attacks represent a paradigm shift targeting RAG systems. The core lies in exploiting the “Retrieval-Augmentation” mechanism itself as an attack vector to achieve indirect manipulation of the Large Language Model (LLM). Unlike traditional direct attacks, attackers do not directly input malicious instructions. Instead, they pre- plant an attack Payload into the external knowledge base or documents relied upon by the RAG system. When a user initiates a specific query, the system retrieves and extracts document fragments containing these malicious instructions based on seman- tic matching principles, feeding them into the LLM as “trusted context.” Since current large models generally lack fine-grained capabilities to distinguish input sources, they cannot effectively differentiate boundaries between system-preset instructions, user current needs, and retrieved external content. This leads the model to easily mis- interpret malicious code embedded in documents as high-priority instructions to be executed [39]. This attack method exploits the RAG system’s implicit trust in retrieved content, transforming data-level pollution into logic-level control. Research currently focuses on two dimensions: Indirect Prompt Injection (IPI) and Indirect Jailbreak. Indirect Prompt Injection (IPI) focuses on hijacking the control flow of the LLM. It aims to rewrite the system’s behavioral logic by manipulating context, causing it to execute tasks preset by the attacker rather than responding to the user’s actual 23 request. [84] confirmed that IPI has an extremely high success rate in RAG applica- tions integrated with retrieval functions. Attackers simply need to embed instructions in hidden text (e.g., white font) within web pages or documents. Once this content is crawled by the RAG system, sensitive information theft, automated phishing, or the dissemination of disinformation can be executed without the user’s knowledge. The fundamental reason for the success of this attack is that RAG systems typically assign higher weight to retrieved context. Greshake found that models exhibit an instruction priority override phenomenon when processing conflicting information; that is, exter- nal retrieval content often overwhelms the system’s original prompt constraints [84]. Building on this, [85] further revealed instruction confusion vulnerabilities in RAG systems, proving that even if the retriever recalls documents containing genuine infor- mation, attackers can disrupt the model’s reasoning path and induce the generator to violate established safety guidelines simply by interleaving adversarial prompts, mak- ing this mixture of truth and falsehood harder to intercept via traditional instruction filtering mechanisms. Indirect Jailbreak attacks focus on bypassing the model’s Safety Alignment strate- gies, using the retrieval mechanism as a springboard to undermine LLM safety constraints. [86] first proposed and validated the effectiveness of this attack path. By planting carefully designed jailbreak payloads into the database, they demonstrated that the retrieval-augmented feature of RAG systems is, in fact, a weak link in secu- rity defense. [87] deeply analyzed the mechanism of this phenomenon, discovering that when malicious text appears under the legitimate guise of “reference materi- als” or “retrieval results,” the LLM’s built-in security censorship mechanisms [88] loosen significantly. The model tends to regard retrieved content as objective fact rather than malicious input, thereby lowering defense thresholds. Furthermore, [89] explored the efficacy of malicious instructions at different positions within documents for long-context RAG systems. They discovered a “Lost-in-the-Middle” reverse effect for RAG: by hiding attack payloads in specific locations of long documents (partic- ularly areas where the model’s attention mechanism is weaker but still processed), attackers can more effectively evade LLM security censorship filters while maintaining a high hijacking success rate in the final generation stage. In summary, indirect attacks exploit the vulnerability of trust propagation in the RAG architecture, posing threats to system stability and user privacy. As RAG application scenarios expand, there is a need to build deep cleaning and isolation mechanisms specifically for retrieved content. 3.5 Chapter Summary In conclusion, RAG systems are facing multi-dimensional and deep-seated security challenges. From data poisoning that destroys the integrity of knowledge bases, to membership inference and embedding inversion that pry into sensitive information, and further to adversarial attacks and indirect prompt injection that manipulate gen- eration logic, attackers’ methods span the entire lifecycle of RAG systems—from data indexing and retrieval interaction to content generation. The existence of these secu- rity risks essentially exposes the endogenous vulnerabilities of RAG systems under the “Retrieval-Augmentation” architecture: excessive trust in external data sources, lack 24 of effective verification mechanisms between components, and the reversibility of vector representations. As RAG technology penetrates critical fields such as finance, health- care, and enterprise-level applications, a single vulnerability can often trigger systemic crises of trust collapse and data leakage. Therefore, relying solely on the robustness of the large model itself is insufficient to cope with the increasingly complex attack envi- ronment. It is urgent to construct a defense system covering data encryption, query filtering, and privacy protection. The following chapters will delve into security pro- tection technologies against these threats, analyzing the principles and effectiveness of various defense strategies. 4 RAG Security Protection Technologies The previous chapter detailed the security risks faced by RAG systems, exposing vul- nerabilities across the entire pipeline. Consequently, establishing a defense-in-depth system has become paramount for ensuring the practical deployment of RAG appli- cations. This chapter categorizes protection technologies into two primary lines of defense: the input side and the output side. Specifically, it discusses data privacy enhancement and security admission mechanisms for the input side, as well as inference defense and information leakage protection technologies for the output side. Addi- tionally, this chapter will present specific defense techniques designed to counter the various attacks targeting RAG systems. Real-time dataReal-time data Professional data Professional data Private dataPrivate data Top-k chunks Query embedding Chunk s Chunk s Sharding Embedding model Embeddin g vector Embedding model Retrieval LLM s Prompt Template Prompt User's query User RerankRerank Data Cleaning & Data Sanitization Techniques Security Preprocessing Layer Data Cleaning & Data Sanitization Techniques Security Preprocessing Layer Privacy Enhancement Layer Differential Privacy Mechanisms - Vector Database Privacy Enhancement Layer Differential Privacy Mechanisms - Vector Database Vector DB Data Encryption Protection Encrypted Storage Security Gateway Access Control Security Gateway Access Control Privacy Enhancement Layer Differential Privacy Mechanisms - User Input Privacy Enhancement Layer Differential Privacy Mechanisms - User Input Vector Database Construction Retrieval Component Generation Component Generate Answer External data Fig. 8: Pipeline-integrated defense mechanisms for secure RAG operations. The schematic maps targeted security interventions to their corresponding vulnerable nodes within the RAG architecture. Key defensive layers (pink boxes) include data sanitization and differential privacy during external data construction, database encryption and strict access control during retrieval, and input-level privacy enhance- ments prior to LLM generation. This end-to-end framework provides a structural blueprint for securing RAG systems against both external adversarial threats and internal data leakage. 25 4.1 Data Privacy Enhancement and Security Admission Serving as the initial line of defense, data privacy enhancement and security admis- sion mechanisms at the input side constitute the foundation of overall RAG security. During the data ingestion and retrieval stages, RAG systems encounter severe security challenges. These primarily manifest as data poisoning attacks that aim to corrupt knowledge sources [38, 43–45] and adversarial attacks designed to manipulate retrieval results [40, 77]. If malicious texts breach the input boundary and enter the vector database or the retrieval pipeline, they directly compromise the purity of internal knowledge, subsequently degrading the stability and reliability of the generated out- puts [38]. Consequently, defenses at this level focus on establishing strict admission controls and preprocessing barriers. As illustrated in Figure 9, these measures block malicious text injection at the source and ensure consistent model predictions despite input perturbations. Current defense strategies for the input side primarily encompass several categories. SECURE DATA INGESTION PIPELINE SECURE QUERY & RETRIEVAL ENGINE PRE-GENERATION GUARDRAILS ROBUST GENERATION SOURCE VERIFICATION ANOMALY DETECTION DATA COLLECTION CHUNKINGEMBEDDING HOMOMORPHIC ENCRYPTION ENCRYPTED STORAGE USER QUERY QUERY PROCESSING EMBEDDING RBAC AUTHENTICATION SIMILARITY SEARCH RERANKING MPD DUAL-STAGE MASKING END-TO-END ENCRYPTED RETRIEVAL SEMANTIC CONSERVATION INTERVENTION CONTEXT ASSEMBLY DYNAMIC POLICY ENFORCEMENT ATTENTION INTERCEPTION RbFT ROBUST FINE-TUNING TRUSTED EXECUTION ENVIRONMENT (TEE) SECURE GENERATED RESPONSE LLM (ROBUST GENERATION) Fig. 9: The proposed input-side defense pipeline for securing RAG architectures. The schematic focuses exclusively on mechanisms that protect the integrity of data entering the LLM. It maps out a four-stage input security flow: (1) securing the ingestion of external databases, (2) establishing an encrypted and authenticated query retrieval engine, and (3) deploying strict pre-generation guardrails (TEE) to intercept malicious attention manipulation. Ultimately, these stringent input-side controls provide a safe and robust context for the final generation phase. 4.1.1 Evolution from RBAC to Fine-Grained Dynamic Filtering in Access Control Access control constitutes the most fundamental and critical first line of defense in the RAG security architecture. Its core function is to define system boundaries, 26 preventing unauthorized access and malicious text injection from external attackers. Concurrently, it ensures that internal sensitive data remains protected from unautho- rized access or leakage, effectively mitigating various threats including data poisoning and membership inference attacks. Practical applications of traditional RAG architec- tures typically employ role-based access control (RBAC). For instance, [90] propose a secure isolation deployment scheme for RAG systems that strictly adheres to Sales- force standards. This approach integrates RBAC with field-level security protocols to explicitly define read and write permissions for different user roles regarding specific documents or data fields within the knowledge base. Through identity authentica- tion, this static defense mechanism intercepts unauthorized users before they initiate retrieval requests, fundamentally reducing the risk of knowledge base corruption or data theft at the source. However, as the reasoning capabilities of Large Language Models (LLMs) enhance, simple static RBAC mechanisms are gradually revealing limitations when facing complex RAG interaction scenarios. The increasingly sophisticated semantic under- standing and associative reasoning abilities of LLMs enable them to infer sensitive information from seemingly harmless non-sensitive fragments or bypass document-level permission restrictions through cross-document information aggregation. Further- more, traditional access control struggles to defend against Indirect Prompt Injection attacks—even if the attacker does not possess high privileges, if the retrieved “low- classification” documents contain malicious instructions planted by the attacker, the model may still be induced to execute unauthorized operations. This implies that ver- ifying user identity solely before retrieval is no longer sufficient to counter threats, as the model’s generation behavior is difficult to constrain by traditional permission rules once it acquires the context. Addressing these challenges, novel defense strategies are evolving toward context- aware fine-grained access control. This strategy no longer relies solely on the user’s static identity but incorporates dynamic permission adjudication based on the user’s current query intent, contextual situation, and the sensitivity classification of retrieved content. To this end, [91] proposed an innovative security protocol that embeds a Policy Enforcement Point (PEP) within the RAG information flow chain. This mech- anism implements entity-level filtering and permission re-verification before retrieved content is input into the LLM. It can identify and strip sensitive entities (such as Per- sonally Identifiable Information (PII) or trade secrets) from documents that exceed the user’s permissions, ensuring the model only “sees” information fragments the user is authorized to know. In summary, access control in RAG systems must consider the characteristics of large models, shifting from coarse-grained document-level control to fine-grained dynamic supervision throughout the entire query and generation lifecycle. 4.1.2 Evolution from Static Storage Encryption to Homomorphic Encryption Computing in Data Protection Input-side protection based on encryption is a significant branch of RAG security research. Its core value lies in introducing “unreadability” into the RAG data link , to defend against embedding inversion attacks: knowledge base texts, embedding vectors, and query contents are stored and processed in encrypted form as much as possible, 27 thereby reducing risks associated with leaks from cloud/third-party components, oper- ations personnel, or logs. Among these techniques, Homomorphic Encryption (HE) provides a unique pathway for “similarity retrieval in the encrypted domain” due to its ability to support direct computation on ciphertext. It holds the promise of complet- ing recall without exposing the plaintext of queries or vector databases, thus offering stronger end-to-end privacy guarantees in highly sensitive scenarios such as healthcare and finance. However, encryption schemes present obvious engineering and systemic bottlenecks when implemented in RAG. First, homomorphic encryption often incurs significant computational and communication overhead, easily becoming a performance bot- tleneck under large-scale vector databases and high-concurrency retrieval demands, affecting retrieval latency and throughput. Second, key management, access autho- rization, encrypted index maintenance, and dynamic updates significantly increase system complexity. Third, while encryption primarily solves the problem of “data invisibility,” it does not inherently address semantic-level risks regarding “whether the content is malicious/will induce model unauthorized access”—even if the retrieval stage is completed in ciphertext, indirect prompt injection and other generation-side attacks remain possible once the content is decrypted and enters the context. There- fore, encryption protection requires a more refined trade-off between security strength, system complexity, and RAG functionality/performance. To address these challenges, existing work largely advances along two routes: “encrypted retrieval” and “hybrid trusted boundaries.” SecureRAG splits retrieval into secure search and secure document acquisition, utilizing Fully Homomorphic Encryp- tion (FHE) to perform similarity calculations on ciphertext vectors, thereby reducing query and embedding leakage risks, albeit with high dependence on HE operators and system optimization [92]. Privacy-Aware RAG emphasizes the combination of full-lifecycle encryption with access/integrity mechanisms, leaning more toward engi- neering usability and closed-loop end-to-end processes, but it must also face issues of cost and trusted boundary delineation brought by encrypted computing [93]. Bae et al. proposed combining untrusted cloud computing with local trusted decryption/genera- tion. By performing encrypted retrieval via a homomorphic encryption vector database on the cloud and final generation locally, this approach reduces the privacy attack surface, though it entails more complex system links and higher requirements for inter- action and edge-side resources [94]. Additionally, some work attempts “lightweight alternatives” to mitigate the high overhead of homomorphic encryption. For example, PRESS reduces privacy leakage risks during the retrieval stage through embedding space transformation with minimal performance cost, though its security is more empirical and typically difficult to achieve cryptographic-strength guarantees [95]. Overall, the research focus of encryption routes is shifting from “whether encrypted retrieval is possible” to “achieving deployable, scalable, and composable end-to-end private RAG at acceptable costs.” In summary, protection on the RAG system input side primarily involves three technical routes: access control, data cleaning and filtering, and data encryption. The development of input-side security protection technologies indicates that the advance- ment of RAG defense technologies requires specific modifications to defense techniques 28 tailored to the specific characteristics of RAG vector retrieval and large model gen- eration, balancing defense effectiveness with the functional normality of the RAG system. 4.1.3 Data Cleansing and Poisoning Filtration Mechanisms for Input-Side Integrity Implementing rigorous cleansing and filtration during the data ingestion and query processing stages constitutes the first line of defense to ensure knowledge base purity and block data poisoning attacks. Traditional input defenses primarily rely on statis- tical analysis of ingested texts and semantic perturbation of user queries. In the early stages of RAG data poisoning, malicious texts typically exhibit syntactic confusion or semantic incoherence. Based on this characteristic, [49] propose using perplexity as a pre-ingestion detection metric, screening potential attack payloads by identifying and discarding abnormal text segments with high perplexity. Regarding user queries, [96] propose an input defense strategy based on paraphrasing. This method rewrites the initial user query to disrupt the similarity mapping between the predefined triggers of the attacker and the malicious texts in the database, thereby reducing the recall probability of malicious documents during the pre-retrieval stage. As the construction techniques for malicious texts evolve rapidly, the threats con- fronting RAG systems at the input side become increasingly stealthy. [43] point out that attackers leverage the generation capabilities of large language models to con- struct malicious content that is highly fluent, logically coherent, and statistically indistinguishable from normal text. Such stealthy poisoned data renders perplexity- based pre-cleansing mechanisms largely ineffective. Concurrently, because modern RAG systems widely employ high-precision semantic retrieval, query paraphrasing techniques struggle to sever the recall chain of malicious texts. Conventional input filtering methods typically operate at the superficial text level and fail to detect care- fully disguised attack payloads, consequently eroding the defense boundaries of RAG systems during the data inflow stage. To address the failure of traditional input filtering mechanisms, next-generation defense strategies evolve towards deep semantic validation and dynamic filtering applied after retrieval and prior to generation. During the data ingestion and index construction stages, [56] propose cross-referencing external data with authoritative sources and isolating outliers that deviate from normal semantic clusters within the embedding space, thereby blocking the input of contaminated data at the source. Dur- ing the input assembly stage where retrieval results return to the language model, [58] analyze the functional mechanism of malicious texts. They note that successful manip- ulation requires the malicious text to capture extremely high attention weights within the model. Accordingly, they propose an abnormal attention detection mechanism that monitors the attention distribution of the input context during early inference stages, dynamically identifying and filtering out retrieved segments that exhibit abnormally strong manipulative properties. In summary, effective input defenses for RAG systems can no longer rely solely on simple rule matching; instead, they require a full-chain deep filtering network encompassing data ingestion cleansing, query processing, and context validation. 29 4.1.4 Detection and Pre-Defense Strategies Against Adversarial Inputs Defense technologies against adversarial attacks in RAG systems are evolving from singular input text purification to full-chain robustness enhancement. As adversarial samples progress from simple character perturbations to complex corpus poisoning and retrieval manipulation, traditional defenses struggle to address deep logical traps and counterfactual deceptions. Consequently, the research community is exploring multi- dimensional defense paradigms. These paradigms encompass model-agnostic input certification, fine-tuning for intrinsic noise resistance in large language models, graph- based semantic reranking, and system-level threat governance. The objective is to construct a robust RAG ecosystem capable of resisting external malicious injections while performing self-correction. Early explorations of input text purification include the masking and purifying defense framework proposed by [97], which provides a theoretical guarantee for the security of queries and retrieved texts. Targeting character-level and word-level per- turbation attacks, this research constructs a model-agnostic certified defense scheme characterized by a two-stage input processing mechanism of masking for denoising and purifying for restoration. The masker module randomly masks the input text based on rules to filter perturbations. The purifier module utilizes an improved BERT-MLM architecture to accurately restore the clean input text through soft embeddings and self-supervised fine-tuning. This method achieves effective interception under high masking rates without modifying the structure of the target generation model, provid- ing fundamental assurance for the reliability of user queries and retrieved documents in adversarial environments. To advance the defense frontier, recent studies delve into the foundational level of the RAG input engine, specifically the retriever mechanism. [98] propose a dual defense framework comprising RAGPart and RAGMask, focusing on directly blocking the recall of adversarial corpora during the retrieval stage. This scheme leverages the segment semantic conservation property of dense retrievers for preemptive interven- tion. RAGPart mitigates the impact of contamination through independent embedding and combinational averaging of document segments, whereas RAGMask identifies and suppresses documents driven by malicious tokens via targeted masking. This strategy of directly intervening in the retrieval process at the input side overcomes the reliance on passive defenses during the generation stage, verifying the efficiency of achieving robust defense through retrieval purification in resource-constrained scenarios. During the input assembly and reranking stage following retrieval recall, the GRADA framework proposed by [99] introduces a graph-based reranking mechanism for precise filtering based on semantic consistency. This method exploits the differ- ences in semantic coherence between adversarial and benign documents to construct a document similarity weighted graph, propagating scores via a PageRank-like algo- rithm. This mechanism effectively clusters benign documents and suppresses isolated adversarial documents, thereby eliminating potential toxins during the reranking stage before feeding the data into the language model. GRADA introduces graph structural analysis to RAG pre-defenses for the first time, significantly reducing the success rate of input attacks that utilize semantic camouflage without sacrificing retrieval quality. 30 When highly stealthy adversarial inputs bypass pre-retrieval and reranking, endow- ing the language model with immunity to malicious inputs constitutes the final line of defense on the input side. The robust fine-tuning method proposed by [100] enhances the discrimination capability of the language model when processing noisy and coun- terfactual retrieved inputs. Through dual-task fine-tuning for defect detection and effective information extraction, this scheme employs LoRA technology to train the model to isolate noise even when the assembled input prompt contains malicious decep- tions. This approach of reducing the absolute trust of the model in malicious retrieved inputs establishes a new paradigm for improving input fault tolerance via fine-tuning. To address increasingly complex input-level threats, [101] construct a structured risk mitigation framework that elevates the perspective of input defense to full-lifecycle security governance. Based on the AI security pyramid of pain and MITRE CWE standards, this research establishes a systematic scheme encompassing threat mod- eling and control deployment. By accurately identifying high-risk input vectors such as prompt injection and data contamination, and by specifically deploying multi-level access and filtering controls, this framework successfully downgrades the front-end security risks confronting RAG systems to a controllable state. This provides a gov- ernance blueprint for enterprise-level RAG input defenses that balances theoretical depth with practical operability. Overall, input-side protection for RAG systems primarily involves three technical routes, specifically access control, data cleansing and filtering, and data encryption. The evolution of these input-side security technologies indicates that advancing RAG defenses requires specific adaptations tailored to the characteristics of vector retrieval and language model generation, thereby balancing defense effectiveness with the functional integrity of the system. 4.2 Inference Defense and Information Leakage Protection Building upon solid input-side defenses, inference defense and information leakage pro- tection technologies targeting the output side are equally crucial for safeguarding the full lifecycle of RAG. With the widespread deployment of RAG systems and the con- tinuous expansion of knowledge bases, systems frequently process massive amounts of highly sensitive user data and private domain knowledge. In this context, any data leakage at the output end caused by model overfitting, inference attacks, or unautho- rized access will severely damage user privacy and system credibility. To address this challenge, academia and industry are committed to constructing multi-layer defense frameworks integrating cryptography and distributed computing, aiming to effectively balance data utility and privacy security , as Figure 10 . This section focuses on key privacy protection technologies adopted by RAG systems for output-side defense. 4.2.1 Differential Privacy from Global Noise Injection to Entity-Level Fine-Grained Perturbations Differential privacy serves as a privacy protection model with a rigorous mathematical definition. Its core concept involves adding controllable noise to data or query results to mask the contribution of specific individual data, ensuring that the addition or removal 31 Local Vector Database Federated & Secure Retrieval Edge Differential Privacy Noise Injector Entity Masker (Sensitive Entities-> Placeholders) Secure LLM Core Adaptive DP Budget Allocator Secure LLM Generation Engine (in Trusted Execution Environment - TEE) Pre-Inference Sanitization Post-Generation Restoration Local Unmasker (Placeholders -> Original) Compliance Check & Verification Secure & Compliant Output Encrypted Computation in TEE Fig. 10: The proposed secure RAG system for preventing output-side privacy leakage. This diagram depicts an end-to-end privacy preservation strategy that physically and cryptographically isolates sensitive user data from the LLM. By executing retrieval and DP noise injection locally (blue), and substituting sensitive entities with place- holders (green), the system ensures the LLM generates responses based entirely on sanitized inputs within a secure enclave (TEE). The defining feature of this output- centric defense lies in the final stage (purple): the generated response, containing only placeholders, is sent back to the local client for structural unmasking and compliance verification. This design strictly confines raw sensitive information to the trusted edge, effectively neutralizing risks such as membership inference and unintended data mem- orization in the final output. of a single sample does not significantly alter the overall output distribution. When constructing privacy-preserving RAG systems, differential privacy can be utilized to defend against membership inference and adversarial attacks. It is widely applied across two critical stages, specifically retrieval and generation. During the retrieval stage, adding noise to query embeddings prevents attackers from reverse-engineering the original user intent. In the generation stage, privacy constraints are imposed on the output text to mitigate the leakage of sensitive information from original training data or private documents. Although differential privacy provides strong theoretical security guarantees, the introduced noise inevitably degrades model utility. Balancing 32 system accuracy and privacy under a limited privacy budget remains a core challenge in current research, particularly when processing long text generation and fine-grained analysis. Targeting full-pipeline privacy protection for RAG systems, Grislain et al. proposed the DP-RAG framework [102] to address privacy leakage from sensitive documents. This framework comprises two key components: first, document retrieval based on the Exponential Mechanism, which associates documents with unique privacy units and sets a utility function threshold τ to ensure the retrieval process satisfies DP; second, a generation mechanism based on DP in-context learning, which generates independent augmented queries for each retrieved document and aggregates token distributions, maintaining the relevance of generated content while ensuring privacy. Experiments show that this method performs excellently in scenarios with high document redun- dancy (e.g., healthcare, where the same information exists in at least 100 documents). Koga et al. focused more on balancing user query privacy and long-text generation, proposing the DPSparseVoteRAG algorithm [103]. Targeting sensitive external cor- pora, this study utilizes sparse vector technology to optimize privacy budget allocation, solving the challenge of generating long and accurate answers under limited budgets and ensuring system deployment complies with privacy regulations and data ethics. To address the risk of verbatim leakage in the post-retrieval generation stage, several studies focus on optimizing decoding strategies. The PAD (Privacy-Aware Decoding) framework [104] addresses the issue where greedy decoding might directly output sensitive content by proposing a lightweight inference-time defense mechanism. It requires no modification to the retriever or retraining of the model; by adaptively injecting calibrated Gaussian noise into token logits, it achieves privacy protection without significantly sacrificing generation quality, making it suitable for rapid deploy- ment. Building on this, the INVISIBLEINK framework [105] further optimized for long-text generation scenarios. It introduces the DClip mechanism, capable of isolat- ing and clipping only the logit differences caused by sensitive documents, ensuring that the presence of a single sensitive document does not significantly affect the gen- eration distribution. Compared to PAD, INVISIBLEINK achieves finer control at the logit processing and vocabulary selection levels, effectively reducing privacy budget consumption and balancing high utility with low computational cost. Beyond general noise injection, some research explores finer-grained defense meth- ods. He et al. proposed the LPRAG (Locally Private RAG) framework [106], based on the concept of Local Differential Privacy (LDP). Instead of crudely processing entire text segments, it identifies private entities such as words, numbers, and phrases, applying exclusive perturbations to different entity types based on an adaptive bud- get allocation strategy. This precise entity-level protection avoids global information distortion, ensuring robust protection for highly sensitive information like nursing data. Furthermore, Yao and Li approached defense from the vector space, proposing a method based on Random Projection [107]. This method uses a Gaussian matrix to project query and document embeddings into a lower-dimensional space. While preserving the similarity required for retrieval, it perturbs the original embedding val- ues, making it difficult for attackers to extract original information via embedding 33 inversion. This method is applicable to KNN-LM [108] and direct prompt-based RAG architectures, achieving record-level differential privacy protection. Comparing these technologies horizontally, DP-RAG is suitable for centralized scenarios with high document redundancy, offering full-process guarantees; DPSpar- seVoteRAG and INVISIBLEINK excel in balancing utility for long-text generation; PAD is suitable for rapid hardening of existing systems due to its lightweight nature; while LPRAG and Random Projection provide finer solutions at the entity level and embedding layer, respectively. 4.2.2 Output-Protected RAG under Federated Learning Frameworks Federated Learning (FL), as a distributed machine learning paradigm, offers the core advantage of collaborative training where data remains stationary while the model moves. In RAG output-side defense, FL provides a viable solution: by decoupling retrieval and generation processes, it ensures the centralized generation model cannot directly access local raw data, physically blocking the risk of the generator memorizing and accidentally outputting sensitive data. This mechanism transforms traditional cen- tralized data leakage risks into controllable model parameter exchange risks, emerges as a key technology for defending against membership inference attacks. Addressing architectural security in the inference output stage of Federated RAG, [109] constructed a comprehensive defense system aimed at eliminating privacy haz- ards in model outputs within highly sensitive fields like healthcare and finance. This study not only utilizes encryption protocols to protect model updates but, more criti- cally, introduces Trusted Execution Environments (TEE) and Selective Query Routing during the inference phase. This design ensures that when the system responds to user requests, it invokes only necessary localized augmented knowledge, preventing the inference results from reverse-exposing the structure of source-side sensitive databases by minimizing data egress paths. To further reduce information leakage risks during inference, Addison et al. pro- posed the C-FedRAG system [110]. This system introduces Confidential Computing into the federated inference flow, designing an orchestrator running in an isolated envi- ronment. Under this architecture, document embedding retrieval is completed locally, while the aggregated inference responsible for final content generation is encapsu- lated within a confidential computing black box. This “retrieval-generation” separation mechanism ensures that the final output of the RAG system is generated from securely aggregated global knowledge, preventing attackers from reverse-reconstructing local private data of participants by analyzing system outputs. At the algorithmic level, to prevent data reconstruction caused by model parameter leakage, the FedE4RAG framework proposed by [111] combines Federated Knowledge Distillation (KD-GLE module) with Homomorphic Encryption (FED-HE module). Unlike direct gradient sharing, this framework utilizes “teacher models” produced by local trusted retrievers to guide the learning of a global “student model.” This approach ensures the final user-facing generation model learns only the generalized global knowledge distribution without memorizing specific local sensitive samples, 34 effectively defending against membership inference attacks targeting the output model while maintaining generation quality. As RAG technology evolves toward multi-agent collaboration, dynamic leakage during inference interaction becomes a new defense focus. The Federated Multi-Agent System (Federated MAS) proposed by [112] focuses on real-time output control dur- ing the inference stage. This research designed embedded privacy-enhancing agents acting as inspection mechanisms for RAG retrieval and context interaction. When multiple agents collaborate on complex tasks, these agents can filter non-task-related redundant information in real-time, allowing only desensitized necessary information to be passed as output to other agents, thereby minimizing the data exposure surface during collaborative inference. In domain-specific application validation, Karamanlıo ̆glu et al. developed a clinical decision support system integrated with stream analytics [113]. This system utilizes federated RAG mechanisms to achieve inference isolation, combined with differential privacy technology, ensuring that diagnostic suggestions output by the system reflect only collective medical patterns while strictly stripping away individual patient charac- teristics. Such strict constraints at the output end enable the system to meet rigorous data egress requirements of regulations like GDPR and HIPAA. In summary, RAG protection technologies under federated learning frameworks essentially reconstruct the system’s output boundaries through distributed architec- ture. Whether it is hardware-level inference isolation in C-FedRAG, algorithm-level knowledge distillation in FedE4RAG, or interaction-level dynamic filtering in Feder- ated MAS, the core objective is to ensure that every output generated by the RAG system undergoes strict de-identification and aggregation processing. These technolo- gies effectively build a solid defense line preventing raw data leakage through model outputs while ensuring knowledge augmentation effectiveness. 4.2.3 Data Sanitization via Lightweight Entity Masking and Inference-Time Privacy Isolation Among the output-side protection technologies for RAG systems, data masking provides a lightweight and efficient reasoning isolation scheme to defend against mem- bership inference attacks. Unlike traditional input cleansing, the core value of this technology from the perspective of output protection lies in achieving the physical decoupling of logical generation and information filling. By mapping sensitive entities, such as names and medical records, to meaningless placeholders like <PERSON 1>, this technology compels the generator to reason exclusively at the semantic logic level without accessing or outputting actual sensitive values. This mechanism ensures that the raw response generated by the model remains fundamentally desensitized. Even if the model is manipulated by attackers or experiences hallucinations, it solely out- puts secure placeholders, thereby isolating semantics from question-answering at the output end [114]. In practical protection implementation, the “Sanitization-Inference-Restoration” workflow proposed by [90] demonstrates how to defend against information leakage by controlling the output construction process: 35 1. Output Constraints during Inference: Before inputting prompts into the large model, they undergo masking processing where sensitive terms are replaced with meaningless masks, and sensitive information along with mask positions are saved in a bidirectional mapping table. The generative model is restricted to generat- ing responses within a safe vocabulary space containing only placeholders. This implies that the intermediate results output by the model inherently contain no user privacy, thoroughly avoiding the risk of plaintext output theft by external model service providers or man-in-the-middle attacks. 2. Localized Output Reconstruction: The actual output process is shifted to a con- trolled local secure environment for execution. Based on the preset bidirectional mapping table, the system reverse-restores the response templates containing placeholders generated by the model. This strategy limits the role of the Large Language Model (LLM) to an untrusted logical processing engine rather than the final publisher of information. By consoli- dating the synthesis authority of the final output locally, this technology effectively defends against eavesdropping and reverse deduction targeting the model output end, ensuring that complete sensitive information is presented to authorized users only after local security verification. Overall, RAG security research is transitioning from early single-threat identifi- cation and vulnerability patching to constructing systematic comprehensive defense architectures with formal guarantees. the current ecosystem of protection technologies has become increasingly comprehensive: from basic access control and data cleaning to mathematically grounded differential privacy and homomorphic encryption, and fur- ther to federated learning solving data silos and lightweight data sanitization. These technologies do not exist in isolation but are gradually forming a complementary defense-in-depth architecture—safeguarding underlying confidentiality via encryption technologies while balancing upper-layer application utility and privacy using dif- ferential privacy and sanitization technologies. This multi-layered integrated defense strategy marks the progression of RAG security research toward a more mature and trusted new stage. Overall, research on RAG security is transitioning from early singular threat identification and vulnerability patching to the construction of systematic, compre- hensive defense architectures with formal guarantees. As illustrated in Figure 8, the current ecosystem of protection technologies has become increasingly comprehen- sive. These technologies span from fundamental access control and data cleansing to mathematically grounded differential privacy and homomorphic encryption, further extending to federated learning for addressing data silos and lightweight data masking. These technologies are not isolated but gradually form a complementary, defense-in- depth architecture. Encryption technologies ensure underlying confidentiality, while differential privacy and masking technologies balance the usability and privacy of upper-layer applications. This multi-layered, integrated defense strategy signifies that RAG security research is advancing towards a more mature and trustworthy phase. 36 4.3 Chapter Summary In summary, defense technologies for RAG systems have constructed a defense- in-depth system covering data sources, inference pipelines, and output terminals. Addressing data poisoning threats, data cleaning and filtering mechanisms establish a trusted foundation for the knowledge base during the index construction phase through strict admission screening and anomaly detection. Facing complex adversarial attacks, robustness enhancement strategies integrate technologies such as input purification, model fine-tuning, and graph-based semantic re-ranking, significantly improving the system’s logical error-correction capabilities under noise interference and counterfac- tual misleading. Regarding defense against membership inference attacks, privacy protection mechanisms effectively block leakage paths that utilize output distribu- tions to reverse-engineer training data by employing differential privacy noise injection and adaptive decoding intervention. Overall, existing defense paradigms are evolv- ing from early single-point static protection to full-lifecycle, semantic-aware, and model-agnostic dynamic collaboration, aiming to ensure that Retrieval-Augmented Generation technology achieves controllable security risks and firmly guarded privacy boundaries while maintaining high utility. 5 Security Evaluation Standards for RAG Technology With the development of protection technologies for Retrieval-Augmented Generation (RAG), its security evaluation has advanced from early qualitative analysis toward standardized, multi-dimensional quantitative benchmarks. Early evaluations largely adopted adversarial testing metrics from Large Language Models (LLMs). However, since RAG architectures introduce external knowledge retrieval and context fusion mechanisms, traditional single-evaluation systems struggle to cover novel threats such as embedding inversion and data poisoning. In recent years, the academic commu- nity has gradually built comprehensive evaluation frameworks covering attack success rates, retrieval fairness, generation robustness, and defense effectiveness. From baseline defense tests targeting aligned models in 2023 to specialized benchmarks for RAG- specific attacks emerging in 2025, security evaluation standards are evolving toward finer granularity, scenario-specific adaptation, and systematization. 5.1 Multi-Dimensional Evaluation Metric System Existing evaluation standards typically categorize security metrics into three dimen- sions: attack effectiveness, system utility, and detection/defense capability. Further- more, higher-order metrics such as fairness and cognitive complexity are introduced based on research focus. Addressing fundamental adversarial attacks, [115] established Attack Success Rate (ASR) as the core metric for measuring the proportion of jailbreaks. They also intro- duced Perplexity and Windowed Perplexity to detect high-perplexity adversarial text. Additionally, this study emphasized the robustness-performance trade-off, asserting that enhancing defense capabilities should not excessively compromise the model’s generation quality (evaluated via AlpacaEval win rates). 37 In complex cognitive task scenarios unique to RAG, [116] introduced a fairness evaluation dimension. Its core metrics include Expected Exposure (E-L), used to measure the deviation between the actual distribution of retrieval results and a target fair distribution (e.g., gender, geography); and Attribution Weight, which combines sentence centrality with paragraph-sentence entailment relationships to evaluate the degree of dependency of generated content on retrieval sources. Statistical tests such as the Wilcoxon Signed-Rank Test were also employed to quantify significant differences in fairness scores between the retrieval and generation ends. Targeting component-level vulnerabilities in RAG, [117] proposed subdivided secu- rity metrics within the SafeRAG benchmark. On the retrieval side, they defined Retrieval Accuracy (RA), aiming to balance the recall of normal context with the suppression of aggressive context. On the generation side, they proposed F1 variant metrics (including F1(correct) and F1(incorrect)) to distinguish generation precision between correct and incorrect options, combining this with Attack Failure Rate (AFR) as a positive security indicator to validate defense effectiveness. Facing large-scale poisoning threats, [118] constructed comprehensive detection and robustness metrics. Beyond conventional Accuracy (ACC) and Attack Success Rate (ASR), this study specifically introduced Detection Accuracy (DACC), False Positive Rate (FPR), and False Negative Rate (FNR) targeting defense mechanisms. Simultaneously, transferability metrics were used to evaluate the generality of attacks across different RAG architectures (e.g., Multi-modal, Agent RAG), thereby revealing the shortcomings of defense systems. 5.2 Evaluation Datasets The construction of evaluation datasets has transitioned from general LLM security data to RAG-specific, structured constructed data. Early research, such as [115], primarily relied on AdvBench (a set of harmful behav- ior prompts) and AlpacaEval (a general instruction set), focusing on assessing the model’s intrinsic interference resistance. With increased attention on RAG fairness, [116] selected the TREC 2022 Fair Ranking Track corpus. Based on the Anderson- Krathwohl taxonomy, they designed 8 categories of cognitive templates, deriving 368 informational queries covering understanding, analysis, and creation dimensions to test system performance under complex cognitive tasks. Targeting RAG-specific attack vectors, [117] constructed the SafeRAG dataset, the first Chinese RAG security benchmark. Based on real news text, this dataset artificially constructed “Question-Context” pairs containing Silver Noise, inter-context conflict, soft advertisements, and white DoS attacks. It covers 5 sensitive fields including politics and finance, filling the data gap for specific bypass techniques. To test the effectiveness of defense mechanisms in large-scale knowledge base envi- ronments, [118] integrated 5 standard QA datasets, such as NQ and HotpotQA, into the RSB benchmark and constructed extended versions (EX-M, EX-L). By injecting semantically similar correct answer texts or distractors into the original datasets, they simulated the massive noise and poisoning threats faced in real retrieval environments, making the evaluation environment closer to actual deployment scenarios. 38 Table 5: Comparison of RAG Evaluation Datasets DatasetApplication ScopesTarget RAG Modules SizeYear Limitations HotpotQAComplex multi-hop logi- cal reasoning QA tasks Multi-hop retrieval, con- text fusion 113K QA pairs (10 top- ics) 2018 Lacks security attack designs; solely evaluates foundational capabili- ties NQSingle-hop knowledge QA in real-world sce- narios Retriever, generator307K training, 15K test pairs 2019 Limited to single-hop tasks; no security evalu- ation TREC2022Fair 8 cognitive templates for bias verification Retriever6.47M+ docs, 48 queries 2022 No active attack evalua- tions; no security threat coverage AdvBench32 harmful behavior- inducing attacks Generator520 harmful instruction samples 2023 No RAG-specific attack designs; weak adaptabil- ity AlpacaEval10+ general instruction tasks Generation quality529 general instruction samples 2023 No security evaluation; no RAG-specific support SafeRAG4 RAG-specific injection attacks Full pipeline100 test pairs (5 domains) 2025 Small sample size; Chinese-only support BPRAG WPRAG BPI WPI AGGD CRAG-AS CRAG-AK JamInject JamOracle JamOpt AP BadRAG Phantom 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 Score Attack Type ACC ASR F1-score BPRAG WPRAG BPI WPI AGGD CRAG-AS CRAG-AK JamInject JamOracle JamOpt AP BadRAG Phantom 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Attack Type No defense InstructRAG TrustRAG Fig. 11: (a) Comparative metrics of 13 attack methods evaluated on the NQ dataset. (b) Impact of InstructRAG and TrustRAG defenses on maintaining ACC against various attacks (NQ dataset) To demonstrate the practical implications of testing in these realistic environ- ments (e.g., the NQ dataset), a comprehensive empirical analysis of the attack-defense dynamic is required. Figure 11 (a) quantitatively highlights the inherent fragility of undefended RAG architectures. The sharp contrast between suppressed Accuracy (ACC) and inflated Attack Success Rates (ASR) across the majority of the 13 evalu- ated methods underscores the lethality of modern poisoning and DoS techniques. In response to these severe vulnerabilities, Figure 11 (b) maps the recovery trajectories of the system under different protective paradigms. The data unequivocally shows that while baseline systems collapse under pressure, defense integration is crucial. Specifically, the TrustRAG framework emerges as a superior solution, significantly outperforming InstructRAG by completely neutralizing the variance between different attack vectors and maintaining a universally stable ACC baseline. 39 5.3 Chapter Summary In summary, security evaluation standards for RAG technology have formed a rel- atively complete system. At the metric level, there has been a multi-dimensional expansion from single Attack Success Rate to covering fairness, attribution weight, and component-level robustness. At the data level, evaluation resources have evolved from general security prompt sets to specialized datasets containing specific attack features (e.g., cognitive bias, logical conflict, large-scale noise). [115] established the baseline for adversarial defense; [116] and [117] revealed fairness degradation and component-level vulnerabilities, respectively; while [118] further validated the limitations of existing defenses under complex architectures and large-scale data. Future evaluation stan- dards will tend more towards automation and dynamism to cope with novel security challenges brought by multi-modal fusion and Agent-based RAG systems. 6 Summary and Future Outlook To clarify the developmental trajectory and overall landscape of RAG security, we systematically distill the core theoretical insights of current research, the critical challenges in practical deployment, and the primary directions for future evolution, as summarized in Table 3. This section provides an in-depth discussion based on this table. We first transcend the limitations of singular technologies to analyze the underlying logic and intrinsic flaws that contribute to the vulnerability of RAG sys- tems. Subsequently, we address the practical barriers encountered by current defense architectures as they evolve toward dynamic and complex scenarios. Finally, from a macroscopic perspective, we propose a forward-looking technical blueprint to resolve the zero-sum trade-off between privacy and efficiency, aiming to construct a fully verifiable next-generation RAG architecture. 6.1 Insight Following a deep investigation into specific defense technologies for RAG, we analyze the endogenous roots of its vulnerabilities from a theoretical perspective. The RAG architecture is not merely a stacking of components; by introducing external evidence to anchor the generation process, it triggers a profound Trust Paradox. To enhance utility, the system is forced to develop a structural dependency on externally retrieved content, and this dependency serves as the very lever exploited by attackers. Research indicates that even models subject to strict safety alignment tend to suppress their internal safety priors and credit external data when facing malicious or misleading context [119]. This phenomenon is termed the Contextual Security Gap, shifting the center of gravity for security from model parameters to the retrieval interface. For instance, attacks such as CorruptRAG and PoisonedRAG demonstrate that injecting a minimal amount of malicious samples into a massive corpus can induce the model to generate predefined false answers or malicious instructions with a success rate often exceeding 90% [120]. This reveals the essential characteristic that RAG security is constrained by the semantic integrity of the retrieval corpus. 40 Table 6: Insights, Challenges, and Prospects No. Insight / Challenge / Prospect Insight I Trust Paradox and the Contextual Security Gap shift the security focus. The structural dependence of RAG on external retrieval causes the model to suppress internal security priors in the presence of malicious contexts (i.e., the Contextual Security Gap). This trust mechanism flaw allows minimal poisoning samples to hijack the generation process via the retrieval interface, revealing that system security is essentially bound by context integrity. Insight I Semantic Gap renders embedding space attacks an intrinsic defect. A fundamental inconsistency exists between human natural language logic and machine dense vector representations. Attack- ers exploit the statistical limitations of cosine similarity to create “adversarial collisions.” Without a symbolic logic verification layer, relying solely on vector retrieval cannot eradicate covert attacks tar- geting the embedding space. Insight I Data flow is trapped in a Privacy-Efficiency zero-sum game. In existing solutions, Fully Homomorphic Encryption (FHE) is inef- ficient, TEEs face side-channel risks, while plaintext retrieval turns RAG into a “data leakage amplifier.” Attackers can steal sensitive records via inference side-channels (e.g., Logits analysis) or embed- ding inversion, making it difficult to balance real-time performance with confidentiality. Challenge I Runtime encrypted data streams face dual bottlenecks of efficiency and side channels. Static encryption cannot protect high-frequency runtime vector operations. Searchable Encryption (SE) faces a conflict between computational overhead and accuracy in high-dimensional retrieval, with complex dynamic index mainte- nance. Furthermore, pure encrypted retrieval fails to mask access patterns, enabling attackers to infer query intent or knowledge base distribution via traffic statistics. Challenge I The Agentic evolution introduces risks of physical cascading failures. As RAG transitions toward active task execution, retrieval poisoning can translate into physical malicious operations. Facing network-wide “delusions” caused by cross-agent infection and uncon- trolled tool invocation, traditional defenses based on text matching struggle to semantically block malicious API calls disguised as nor- mal logic. Challenge I Evaluation benchmarks are lagging and constrained by multi-dimensional metric coupling. Existing evaluations lack end-to-end standards, with security and efficacy metrics highly cou- pled. Static datasets fail to capture black-box ranking attacks or adaptive stealthy poisoning, while automated testing relying on “LLM-as-a-Judge” suffers from bias and hallucinations, making it difficult to quantify the defense boundaries against unknown threats. Prospect I Constructing an efficient confidential computing architec- ture to break the zero-sum game. The future lies in developing dedicated Privacy-Enhancing Technologies (PETs), such as bal- ancing utility and compliance via sparse Differential Privacy, and designing hybrid encryption protocols (SE+HE) for efficient cipher- text Approximate Nearest Neighbor (ANN) search, thereby resolving runtime privacy leakage while ensuring real-time performance. Prospect I Establishing a standard for White-box RAG (SAG) based on cryptographic verification. Systems will shift from black-box trust to provable security. By isolating poisoning via full-link encryp- tion, utilizing recursive SNARKs for practical Zero-Knowledge data provenance, and integrating Verifiable Vector Retrieval technologies, the integrity and immutability of retrieval results can be mathemat- ically guaranteed, building a trustless secure interaction. Prospect I Shifting from content security to dynamic governance of agent behavior. With the rise of Agentic RAG, defense focus must shift to “Behavioral Security.” Utilizing RLHF for Security to constrain tool invocation and establishing dynamic runtime sand- boxes to isolate side effects are essential to prevent hijacked agents from autonomously executing destructive operations within enter- prise networks. 41 Beyond the shift in trust mechanisms, another major endogenous defect of RAG systems stems from the Semantic Gap—the fundamental inconsistency between nat- ural language logic understood by humans and dense vector representations processed by machines. Existing vector databases rely on statistical features like cosine similar- ity for retrieval, failing to capture true logical entailment. Attackers exploit this gap to craft adversarial samples: a string of text that appears as gibberish or benign to humans can, after encoding by an embedding model, possess a vector that perfectly overlaps with high-confidentiality data or specific trigger instructions in the embedding space. As long as RAG systems overly rely on singular dense vector representations without a symbolic logical verification layer, embedding inversion attacks remain an ineradicable theoretical bottleneck [121]. Furthermore, at the data flow level, current RAG solutions are trapped in a severe “Privacy-Efficiency” Zero-Sum Game. Fully Homomorphic Encryption (FHE), while theoretically secure, incurs immense computational overhead, making it unsuitable for real-time interaction demands; Trusted Execution Environment (TEE) solutions are faster but are limited by hardware roots of trust and face side-channel risks. Lacking strong privacy protection, RAG systems effectively act as efficient “Data Leakage Amplifiers.” Since Top-k documents are directly input as context, attackers can induce outputs via prompt injection or analyze logits to infer sensitive records within the database with high precision [122]. Simultaneously, once vector index access is compromised, the risk of recovering original text using embedding inversion techniques renders the security of the vector database itself a weak link in the entire chain [38]. Based on these underlying insights, although RAG security research has made progress in local defenses, it still faces severe challenges in the face of Agentic trends and zero-trust demands. Future research must move beyond “patch-style defense” and evolve toward full-chain verifiability and behavioral security governance. 6.2 Challenges 6.2.1 Efficiency Bottlenecks and Privacy Risks of Runtime Encrypted Data Flows Current RAG encryption defenses are mostly limited to the static storage phase, which prevents direct theft [93] but fails to protect runtime data flows. The core of RAG relies on high-frequency operations within vector databases; however, performing sim- ilarity retrieval on embedding vectors in ciphertext faces severe challenges. First is the contradiction between computational efficiency and retrieval precision: Searchable Encryption (SE) schemes typically incur high computational and communication overheads, making it difficult to process high-dimensional vector retrieval without sacrificing RAG real-time performance [123]. Second is the complexity of index main- tenance: the dynamic updates of RAG knowledge bases require encrypted indexes to support efficient addition, deletion, and modification while ensuring both forward and backward secrecy. Finally, there are deep privacy leakage risks: embedding vectors themselves may imply plaintext semantics [41], and simple encrypted retrieval cannot 42 mask access patterns, allowing attackers to implement Membership Inference Attacks via statistical inference. 6.2.2 Compound Risks and Cascading Failures in Agentic Systems As RAG evolves from passive QA to task-executing Agentic RAG, security risks escalate qualitatively. Retrieval poisoning no longer leads merely to text errors but can translate into malicious operations in the physical world. This shift introduces cross- agent infection risks, where a poisoned agent may propagate erroneous information to the entire agent network via a shared memory pool, causing “systemic delusion” [124]. Simultaneously, the uncontrollability of tool usage intensifies; text-matching filters fail to judge the semantic legitimacy of API calls (e.g., modifying code, sending emails), especially when malicious instructions disguise themselves as normal business logic [125]. Current static defenses lack active mechanisms like AgenticRed [126] that self-evolve to block complex attack chains in real-time, leaving systems exposed to the threat of cascading failure. 6.2.3 Fragmentation of Evaluation Benchmarks and Lag in Adversarial Evolution Current RAG security assessment lacks end-to-end standardized benchmarks, mak- ing it difficult to fairly quantify the effectiveness of defense technologies. First is the complexity of evaluation tool design: the multi-modular nature of RAG couples secu- rity highly with metrics like retrieval quality and generation faithfulness, and existing evaluation frameworks still require refinement in numerical derivation details [127]. Second is the diversity and stealthiness of attacks: as novel attacks designed to bypass detection—such as stealthy poisoning—constantly emerge [57], benchmarks based on static datasets are gradually becoming ineffective. Furthermore, the ability to detect unknown threats in black-box attack scenarios is difficult to measure effectively using existing fragmented tests. 6.3 Outlook 6.3.1 Constructing Efficient Full-Chain Confidential Computing Architectures To break the privacy-efficiency zero-sum game, future research must develop high- performance privacy-enhancing technologies tailored for RAG. On one hand, the application of sparse differential privacy algorithms, such as DPSparseVoteRAG, should be promoted. By sparsifying vectors before noise injection, these methods retain key semantic features while meeting strict privacy budgets, achieving a balance between privacy and utility [103]. On the other hand, research should focus on explor- ing hybrid protocols combining efficient Searchable Encryption and Homomorphic Encryption, designing index structures specifically for approximate nearest neighbor search in ciphertext vector spaces to reduce computational complexity while ensuring retrieval precision. 43 6.3.2 Moving Toward Cryptographically Verifiable White-Box RAG Architectures Future RAG systems will transcend trust-based black-box models to establish Prov- ably Secure RAG standards. Research focuses should center on full-chain encryption and isolation, adopting pre-stored fully encrypted schemes to ensure retrieval con- tent remains invisible during storage, transmission, and computation, fundamentally blocking poisoning based on database access [128]. Simultaneously, practical Zero- Knowledge Provenance should be achieved via algorithmic optimization (e.g., Recursive SNARKs), providing concise cryptographic proofs for every generation to certify that the response originates from an authenticated and untampered dataset [129]. Additionally, integrating verifiable vector retrieval technologies will enable users to mathematically verify the accuracy of Top-k results, eliminating retrieval hijacking risks and constructing secure interaction mechanisms without trust assumptions. 6.3.3 Behavioral Security Governance for Agentic RAG Future RAG will no longer be a simple “retrieval-generation” pipeline but will evolve into Agentic RAG. Agents will not only retrieve information but also call tools, execute code, and even modify databases. Security research will shift from singular “Content Security” to “Behavioral Security.” Preventing poisoned RAG Agents from autonomously executing malicious operations within enterprise intranets (e.g., database deletion, phishing) represents a new research high ground. Future tech- nical directions include developing Reinforcement Learning-based safety alignment algorithms specifically to constrain Agent tool-use behaviors, as well as establishing dynamic runtime sandboxes to isolate Agent side effects [130]. 7 Conclusion Retrieval-Augmented Generation (RAG) systems, through the synergy of retrieval and generation modules, effectively expand the knowledge boundaries of Large Language Models (LLMs) and improve the accuracy and timeliness of outputs. They have demon- strated extensive application potential in critical fields such as healthcare, education, and smart contracts. However, their multi-modular architecture and dependency on external knowledge introduce unique security risks distinct from traditional large mod- els. Consequently, building a comprehensive security protection system has become a core prerequisite for the large-scale implementation of RAG technology. This survey has systematically reviewed the current status of RAG security research. Starting from the architectural foundation, it comprehensively analyzed prevailing security threats: data poisoning attacks tamper with system outputs by injecting malicious data; membership inference attacks threaten the privacy of knowl- edge bases; adversarial attacks manipulate retrieval and generation processes via stealthy perturbations; and other risks such as embedding inversion and indirect prompt injection further exacerbate system vulnerabilities. These threats impact the reliability, integrity, and confidentiality of RAG systems from multiple dimensions, including data, models, and privacy. 44 Addressing these security challenges, existing defense technologies have coalesced into two core directions: data and model security and privacy protection. Methods such as data cleaning, encryption protection, and access control have solidified the security perimeter for data and models, while technologies like federated learning and anonymization provide effective support for privacy protection. These defense strategies offer viable pathways for mitigating currently known security risks. Simulta- neously, the field has preliminarily established experimental standards and evaluation benchmarks based on specific attack types, providing crucial tools for the quantitative analysis of attack and defense effectiveness. Finally, regarding the challenges and threats facing RAG security research, this survey provided a future outlook. It highlighted that RAG security research still faces numerous open challenges: insufficient real-time defense capabilities during system runtime, a lack of formal security proofs for existing defense technologies, and limited coverage and adaptability of benchmark tests. Future research should focus on advanc- ing dynamic encryption defense, provably secure RAG, and comprehensive standard benchmarking. Through technical innovation and system perfection, the goal is to con- struct RAG systems that possess security, efficiency, and practicality. This survey aims to serve as a reference for research dedicated to promoting the security and reliability of RAG systems in their further practical development. References [1] Vaswani, A.: Attention is all you need. Advances in Neural Information Processing Systems (2017) [2] Devlin, J.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) [3] Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training. Available: https://api.semanticscholar.org/CorpusID:49313245 (2018) [4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) [5] Guo, D., Yang, D., et al.: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948 (2025). https://doi.org/10.48550/arXiv.2501.12948 . http://arxiv.org/abs/2501.12948 [6] Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K., Gutierrez, L., Tan, T.F., Ting, D.S.W.: Large language models in medicine. Nature Medicine 29(8), 1930– 1940 (2023) [7] Boiko, D.A., MacKnight, R., Kline, B., Gomes, G.: Autonomous chemical research with large language models. Nature 624(7992), 570–578 (2023) 45 [8] Jiang, G., Ma, Z., Zhang, L., Chen, J.: Eplus-llm: A large language model-based computing platform for automated building energy modeling. Applied Energy 367, 123431 (2024) [9] Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., Zeng, A.: Code as policies: Language model programs for embodied control. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), p. 9493–9500. IEEE, ??? (2023) [10] Webb, T., Holyoak, K.J., Lu, H.: Emergent analogical reasoning in large language models. Nature Human Behaviour 7(9), 1526–1541 (2023) [11] Wei, J., Wei, J., Tay, Y., Tran, D., Webson, A., Lu, Y., Chen, X., Liu, H., et al.: Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846 (2023) [12] Huang, J., Chang, K.C.-C.: Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403 (2022) [13] Huang, L., Yu, W., Ma, W., et al.: A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43(2), 1–55 (2025) [14] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K ̈uttler, H., Lewis, M., Yih, W.-t., Rockt ̈aschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Advances in Neural Information Processing Systems, vol. 33, p. 9459–9474. Curran Associates, Inc., ??? (2020). Accessed: 2024-12-11. https://proceedings.neurips.c/paper/2020/hash/6b493230205f780e1bc26945df7481e5- Abstract.html [15] Zhao, T., Chen, J., Ru, Y., et al.: RAG Safety: Exploring Knowledge Poisoning Attacks to Retrieval-Augmented Generation (2025) [16] Wang, C., Li, H., Song, W., et al.: Retrieval-augmented generation: A survey of security challenges and countermeasures. In: 2025 11th IEEE International Conference on Privacy Computing and Data Security (PCDS), p. 210–217 (2025) [17] Oche, A.J., Folashade, A.G., Ghosal, T., et al.: A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions (2025) [18] Edge, D., Trinh, H., Cheng, N., et al.: From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130. Accessed: 2026-01-23 (2025). http://arxiv.org/abs/2404.16130 46 [19] Zhao, J., Liu, X.: Tcaf: A multi-agent approach of thought chain for retrieval augmented generation. arXiv preprint (2024) [20] Quynh Nhu, N.D., Minh Quan, L., Van, T.H., et al.: Rag-smartvuln: Enhanc- ing smart contract vulnerability detection via retrieval-augmented llms. In: 2025 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), p. 1–6 (2025) [21] Husain, M.L., Wibisono, Y., Anisyah, A.: Development of an academic services chatbot based on retrieval-augmented generation (rag). Brilliance: Research of Artificial Intelligence 5(2), 727–735 (2025) [22] Abo El-Enen, M., Saad, S., Nazmy, T.: A survey on retrieval-augmentation generation (rag) models for healthcare applications. Neural Computing and Applications (2025) [23] Dobreva, J., Karasmanakis, I., Ivanisevic, F., et al.: Ragcare-qa: A benchmark dataset for evaluating retrieval-augmented generation pipelines in theoretical medical knowledge. Data in Brief 63, 112146 (2025) [24] An, B., Zhang, S., Dredze, M.: RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models (2025) [25] Wang, L., Zhu, T., Qin, L., et al.: Bias Amplification in RAG: Poisoning Knowledge Retrieval to Steer LLMs (2025) [26] Anderson, M., Amit, G., Goldsteen, A.: Is my data in your retrieval database? membership inference attacks against retrieval augmented generation. arXiv preprint (2024) [27] Chen, Z., Liu, J., Gong, Y., et al.: FlippedRAG: Black-Box Opinion Manipula- tion Adversarial Attacks to Retrieval-Augmented Generation Models (2025) [28] Panda, M., Mukherjee, S.: Advancing privacy and security in generative ai- driven rag architectures: A next-generation framework. International Journal of Artificial Intelligence & Applications 16(2), 15–24 (2025) [29] Ammann, L., Ott, S., Landolt, C.R., et al.: Securing RAG: A Risk Assessment and Mitigation Framework (2025) [30] Gummadi, V., Udayaraju, P., Sarabu, V.R., et al.: Enhancing communication and data transmission security in rag using large language models. In: 2024 4th International Conference on Sustainable Expert Systems (ICSES), p. 612–617 (2024) [31] Wang, X., Wang, Z., Gao, X., et al.: Searching for best practices in retrieval- augmented generation. arXiv preprint arXiv:2407.01219. Accessed: 2026-01-28 47 (2024). http://arxiv.org/abs/2407.01219 [32] Sawarkar, K., Mangal, A., Solanki, S.R.: Blended rag: Improving rag (retriever- augmented generation) accuracy with semantic search and hybrid query-based retrievers. In: 2024 IEEE 7th International Conference on Multimedia Infor- mation Processing and Retrieval (MIPR), p. 155–161 (2024). Accessed: 2026-01-28. http://arxiv.org/abs/2404.07220 [33] Zhang, N., Choubey, P.K., Fabbri, A., et al.: SiReRAG: Indexing similar and related information for multihop reasoning. arXiv preprint arXiv:2412.06206. Accessed: 2026-01-28 (2025). http://arxiv.org/abs/2412.06206 [34] Sarthi, P., Abdullah, S., Tuli, A., et al.: RAPTOR: Recursive abstractive pro- cessing for tree-organized retrieval. arXiv preprint arXiv:2401.18059. Accessed: 2026-01-28 (2024). http://arxiv.org/abs/2401.18059 [35] Faysse, M., Sibille, H., Wu, T., et al.: ColPali: Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449. Accessed: 2026-01-28 (2025). http://arxiv.org/abs/2407.01449 [36] Asai, A., Wu, Z., Wang, Y., et al.: Self-RAG: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511. Accessed: 2026-01-28 (2023). http://arxiv.org/abs/2310.11511 [37] Tran, H., Yao, Z., Wang, J., et al.: RARE: Retrieval-augmented reason- ing enhancement for large language models. arXiv preprint arXiv:2412.02830. Accessed: 2026-01-28 (2025). http://arxiv.org/abs/2412.02830 [38] Zou, W., Geng, R., Wang, B., et al.: Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In: USENIX Security (2025) [39] (OWASP), O.W.A.S.P.: OWASP top 10 for large language model applica- tions (2025). Accessed: 2025-01-06 (2025). https://genai.owasp.org/resource/ owasp-top-10-for-llm-applications-2025/ [40] Cho, S., Jeong, S., Seo, J., et al.: Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations. In: Findings of the Association for Computational Linguistics: EMNLP 2024, p. 2826–2844. Association for Computational Linguistics, Miami, Florida, USA (2024) [41] Morris, J.X., Kuleshov, V., Shmatikov, V., et al.: Text embeddings reveal (almost) as much as text. arXiv preprint arXiv:2310.06816. Accessed: 2025-01-06 (2023). https://arxiv.org/abs/2310.06816 [42] Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks 48 against machine learning models. In: Proceedings of IEEE Symposium on Security and Privacy (SP), p. 3–18. IEEE, ??? (2017) [43] Jiao, Y., Wang, X., Yang, K.: PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization (2025) [44] Song, H., Liu, Y., Zhang, R., et al.: Chain-of-Thought Poisoning Attacks against R1-based Retrieval-Augmented Generation Systems (2025) [45] Zuo, K., Liu, Z., Dutt, R., et al.: How to make Medical AI Systems safer? Simulating Vulnerabilities, and Threats in Multimodal Medical RAG System (2025) [46] Cheng, P., Ding, Y., Ju, T., et al.: TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models (2024) [47] Xue, J.Q., Zheng, M.X., Hu, Y.B., et al.: BadRAG: Identifying vulnerabili- ties in retrieval augmented generation of large language models. arXiv preprint arXiv:2406.00083 (2024). https://arxiv.org/abs/2406.00083 [48] Gonen, H., Iyer, S., Blevins, T., et al.: Demystifying Prompts in Language Models via Perplexity Estimation (2022) [49] Alon, G., Kamfonas, M.: Detecting Language Model Attacks with Perplexity (2023) [50] Ma, Z., Huang, X., Wang, Z., Qin, Z., Wang, X., Ma, J.: Fedghost: Data-free model poisoning enhancement in federated learning. IEEE Transactions on Infor- mation Forensics and Security 20, 2096–2108 (2025) https://doi.org/10.1109/ TIFS.2025.3539087 [51] Lin, Z., Cui, J., Liao, X., et al.: Malla: Demystifying Real-world Large Language Model Integrated Malicious Services (2024) [52] Yu, L., Zhang, Y., Zhou, Z., et al.: Spa-VLM: Stealthy Poisoning Attacks on RAG-based VLM (2025) [53] Wang, H., Zhang, R., Wang, J., et al.: Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems (2025) [54] Peng, Y., Wang, J., Yu, H., et al.: Data Extraction Attacks in Retrieval- Augmented Generation via Backdoors (2024) [55] Zhang, B., Chen, Y., Liu, Z., et al.: Practical Poisoning Attacks against Retrieval-Augmented Generation (2025) [56] Chaudhari, H., Severi, G., Abascal, J., et al.: Phantom: General trigger attacks 49 on retrieval augmented language generation. arXiv preprint arXiv:2405.20485 (2024). https://arxiv.org/abs/2405.20485 [57] Li, C., Zhang, J., Cheng, A., et al.: Cpa-rag: Covert poisoning attacks on retrieval-augmented generation in large language models. arXiv preprint arXiv:2505.19864 (2025) [58] Choudhary, S., Palumbo, N., Hooda, A., et al.: Through the Stealth Lens: Rethinking Attacks and Defenses in RAG (2025) [59] Salem, A., Zhang, Y., Humbert, M., Berrang, P., Fritz, M., Backes, M.: Ml-leaks: model and data independent membership inference attacks and defenses on machine learning models. In: Proceedings of Network and Distributed Systems Security Symposium (2019) [60] Song, L., Mittal, P.: Systematic evaluation of privacy risks of machine learning models. In: Proceedings of the 30th USENIX Security Symposium, p. 2615– 2632. USENIX Association, ??? (2021) [61] Sablayrolles, A., Douze, M., Schmid, C., J ́egou, H.: White-box vs black- box: Bayes optimal strategies for membership inference. In: Proceedings of International Conference on Machine Learning, p. 5558–5567. PMLR, ??? (2019) [62] Wang, X., Zhao, Y., Zhang, J., et al.: Label-only membership inference attack against federated distillation. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing, p. 394–410. Springer, ??? (2023) [63] Nasr, M., Shokri, R., Houmansadr, A.: Comprehensive privacy analysis of deep learning: passive and active white-box inference attacks against centralized and federated learning. In: Proceedings of IEEE Symposium on Security and Privacy, p. 739–753. IEEE, ??? (2019) [64] Wen, R., Li, Z., Backes, M., Zhang, Y.: Membership inference attacks against in- context learning. arXiv preprint arXiv:2409.01380. Accessed: 2025-12-27 (2024). https://arxiv.org/abs/2409.01380 [65] Carlini, N., Ippolito, D., Jagielski, M., et al.: Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646. Accessed: 2025-12-27 (2022). https://arxiv.org/abs/2202.07646 [66] Pattern, J., Mireshghallah, F., Jin, Z., et al.: Membership inference attacks against language models via neighbourhood comparison. arXiv preprint arXiv:2305.18462. Accessed: 2025-12-27 (2023). https://arxiv.org/abs/2305. 18462 50 [67] Mireshghallah, F., Goyal, K., Uniyal, A., et al.: Quantifying privacy risks of masked language models using membership inference attacks. arXiv preprint arXiv:2203.03929. Accessed: 2025-12-27 (2022). https://arxiv.org/abs/2203. 03929 [68] Satvaty, A., Verberne, S., Turkmen, F.: EL-MIA: Quantifying Membership Inference Risks of Sensitive Entities in LLMs (2025) [69] Gao, X., Meng, X., Dong, Y., et al.: DCMIA: Differential Calibration Member- ship Inference Attack Against Retrieval-Augmented Generation (2025) [70] Wang, G., He, J., Li, H., et al.: Rag-leaks: Difficulty-calibrated membership inference attacks on retrieval-augmented generation. Science China Information Sciences 68(6) (2025) [71] Choi, Y., Park, Y., Byun, J., et al.: Safeguarding Privacy of Retrieval Data against Membership Inference Attacks: Is This Query Too Close to Home? (2025) [72] Liu, M., Zhang, S., Long, C.: Mask-based membership inference attacks for retrieval-augmented generation. In: Proceedings of the ACM on Web Conference 2025, p. 2894–2907. Association for Computing Machinery, ??? (2025) [73] Li, Y., Liu, G., Wang, C., et al.: Generating Is Believing: Membership Inference Attacks against Retrieval-Augmented Generation (2024) [74] Hu, Z.B., Wang, C., Shu, Y.F., et al.: Prompt perturbation in retrieval- augmented generation based large language models. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, p. 1119–1130. Association for Computing Machinery, Barcelona, Spain (2024) [75] Liu, Y.A., Zhang, R.Q., Guo, J.F., et al.: Multigranular adversarial attacks against black-box neural ranking models. In: Proceedings of the 47th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, p. 1391–1400. Association for Computing Machinery, New York, NY, USA (2024) [76] Wu, C., Zhang, R.Q., Guo, J.F., et al.: Prada: Practical black-box adversarial attacks against neural ranking models. ACM Trans. Inf. Syst. 41(4) (2023) [77] Liu, J.W., Kang, Y.Y., Tang, D., et al.: Order-disorder: Imitation adversarial attacks for black-box neural ranking models. In: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. Association for Computing Machinery, New York, NY, USA (2022) [78] Zhang, Y.C., Li, Q.F., Du, T.Y., et al.: HijackRAG: Hijacking attacks against retrieval-augmented large language models. arXiv preprint arXiv:2410.22832 51 (2024). https://arxiv.org/abs/2410.22832 [79] Chen, Z., Liu, J., Liu, H., et al.: Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models (2024) [80] Song, H., Liu, Y., Zhang, R., et al.: The silent saboteur: Imperceptible adver- sarial attacks against black-box retrieval-augmented generation systems. In: Findings of the Association for Computational Linguistics (2025) [81] Gong, Y., Chen, Z., Chen, M., et al.: Topic-FlipRAG: Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models (2025) [82] Pan, D., Li, S., Yang, H., et al.: Understanding the privacy risks of text embed- dings: An inversion attack perspective. In: Proceedings of the 2020 International Conference on Machine Learning, p. 3452–3461. PMLR, ??? (2020) [83] He, J., Graham, C., Kakade, S., et al.: Abandoning any-to-any: A case study in text embedding inversion. arXiv preprint arXiv:2305.13112 (2023) [84] Greshake, K., Abdelnabi, S., Mishra, S., et al.: Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, p. 79–90. ACM, Copenhagen, Denmark (2023) [85] Liu, Y., Deng, G., Li, Y., et al.: Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499 (2023) [86] Deng, G., Liu, Y., Wang, K., et al.: Pandora: Jailbreak gpts by retrieval augmented generation poisoning. arXiv preprint arXiv:2402.08416. Accessed: 2025-01-06 (2024). https://arxiv.org/abs/2402.08416 [87] Zhuo, T.Y., Pan, Y., Chen, Q., et al.: Shadow alignment: The safety perils of retrieval-augmented generation. arXiv preprint arXiv:2402.14447 (2024) [88] Wang, Z., Liu, J., Zhang, S., et al.: Poisoned langchain: Jailbreak llms by langchain. arXiv preprint arXiv:2406.18122. Accessed: 2025-01-06 (2024). https: //arxiv.org/abs/2406.18122 [89] Yalamuru, S., Chen, H., Wang, S., et al.: Implicit prompt injection attacks in retrieval-augmented generation systems. arXiv preprint arXiv:2312.14132 (2023) [90] Haridasan, P.K.: The salesforce einstein trust layer for retrieval-augmented generation (rag) for enterprise applications. International Journal of Scientific Research in Engineering and Management 08(10), 1–3 (2024) 52 [91] Singh, N.N.: Context-aware access control in saas environments: A metric- driven framework. Journal of Information Systems Engineering and Management 10(58s), 1052–1060 (2025) [92] Anonymous: SecureRAG: End-to-end secure Retrieval-Augmented Generation. Accessed: 2025-11-11 (2025). https://openreview.net/forum?id=5uXACIHz6K [93] Zhou, P., Feng, Y., Yang, Z.: Privacy-Aware RAG: Secure and Isolated Knowl- edge Retrieval (2025) [94] Bae, Y., Kim, M., Lee, J., et al.: Privacy-preserving LLM interaction with socratic chain-of-thought reasoning and homomorphically encrypted vector databases. arXiv preprint arXiv:2506.17336. Accessed: 2025-11-11 (2025). http: //arxiv.org/abs/2506.17336 [95] He, J., Liu, C., Hou, G., et al.: Press: Defending privacy in retrieval-augmented generation via embedding space shifting. In: Proc of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p. 15. IEEE Press, Piscataway, NJ (2025) [96] Jain, N., Schwarzschild, A., Wen, Y., et al.: Baseline defenses for adversarial attacks against aligned language models. Accessed: 2025-12-29 (2023). https: //arxiv.org/ [97] Zhang, H., Gu, Z., Tan, H., et al.: Masking and purifying inputs for blocking textual adversarial attacks. Information Sciences 648, 119501 (2023) [98] Pathmanathan, P., Panaitescu-Liess, M.-A., Chiang, C.-Y.J., et al.: RAGPart & RAGMask: Retrieval-Stage Defenses Against Corpus Poisoning in Retrieval- Augmented Generation (2025) [99] Zheng, J., Gema, A.P., Hong, G., et al.: GRADA: Graph-based Reranking against Adversarial Documents Attack (2025) [100] Tu, Y., Su, W., Zhou, Y., et al.: RbFT: Robust Fine-tuning for Retrieval- Augmented Generation against Retrieval Defects (2025) [101] Ward, C.M., Harguess, J.: Adversarial Threat Vectors and Risk Mitigation for Retrieval-Augmented Generation Systems (2025) [102] Grislain, N.: Rag with differential privacy. In: Proc of the 2025 IEEE Conference on Artificial Intelligence (CAI), p. 847–852. IEEE Press, Piscataway, NJ (2025) [103] Koga, T., Wu, R., Chaudhuri, K.: Privacy-preserving Retrieval-Augmented Gen- eration with differential privacy. arXiv preprint arXiv:2412.04697. Accessed: 2025-11-11 (2024). http://arxiv.org/abs/2412.04697 53 [104] Wang, H., Xu, X., Huang, B., et al.: Privacy-aware decoding: Mitigating pri- vacy leakage of large language models in Retrieval-Augmented Generation. arXiv preprint arXiv:2508.03098. Accessed: 2025-11-11 (2025). http://arxiv.org/abs/ 2508.03098 [105] Vinod, V., Pillutla, K., Thakurta, A.G.: INVISIBLEINK: High-utility and low- cost text generation with differential privacy. arXiv preprint arXiv:2507.02974. Accessed: 2025-11-11 (2025). http://arxiv.org/abs/2507.02974 [106] He, L., Tang, P., Zhang, Y., et al.: Mitigating privacy risks in retrieval- augmented generation via locally private entity perturbation. Information Processing & Management 62(4), 104150 (2025) [107] Yao, D., Li, T.: Private retrieval augmented generation with random projection. In: Proc of the ICLR 2025 Workshop on Building Trust in Language Models and Applications. ICLR, ??? (2025) [108] Khandelwal, U., Levy, O., Jurafsky, D., et al.: Generalization through memo- rization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172. Accessed: 2025-11-11 (2019). http://arxiv.org/abs/1911.00172 [109] Chakraborty, A., Dahal, C., Gupta, V.: Federated retrieval-augmented genera- tion: a systematic mapping study. IEEE Transactions on Knowledge and Data Engineering 37(8) (2025) [110] Addison, P., Nguyen, M.T.H., Medan, T., et al.: C-fedrag: A confidential feder- ated retrieval-augmented generation system. arXiv preprint arXiv:2412.13163. Accessed: 2025-11-11 (2024). http://arxiv.org/abs/2412.13163 [111] Mao, Q.R., Zhang, Q.L., Hao, H.W., et al.: Privacy-preserving federated embed- ding learning for localized retrieval-augmented generation. JOURNAL OF LATEX CLASS FILES 14(8), 1–18 (2015) [112] Shi, Z., Wan, G., Huang, W., et al.: Privacy-enhancing paradigms within federated multi-agent systems. arXiv preprint arXiv:2503.08175. Accessed: 2025-11-11 (2025). http://arxiv.org/abs/2503.08175 [113] Karamanlio ̆glu, A., Demirel, B., Tural, O., et al.: Privacy-preserving clinical decision support for emergency triage using llms: System architecture and real- world evaluation. Applied Sciences 15(15), 8412 (2025) [114] Zhang, X., Zhang, Y., Luo, G., et al.: Identity-based format preserving encryp- tion of data desensitization program. In: 2020 International Conference on Computer Engineering and Application (ICCEA), p. 104–107 (2020) [115] Jain, N., Schwarzschild, A., Wen, Y., et al.: Baseline Defenses for Adversarial Attacks Against Aligned Language Models (2023) 54 [116] Liang, X., Niu, S., Li, Z., Zhang, S., et al.: Saferag: Benchmarking security in retrieval-augmented generation of large language model. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, p. 4609–4631. Association for Computational Linguistics, Padua, Italy (2025) [117] Avula, S., Zhang, R., Lee, C.-J., et al.: Measuring the fairness gap between retrieval and generation in rag systems using a cognitive com- plexity framework. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 2994– 2998. ACM, Padua, Italy (2025). https://doi.org/10.1145/3726302.3730230 . https://doi.org/10.1145/3726302.3730230 [118] Zhang, B., Xin, H., Li, J., Zhang, D., Fang, M., Liu, Z., Nie, L., Liu, Z.: Benchmarking poisoning attacks against retrieval-augmented generation. arXiv preprint arXiv:2504.03957 (2025). https://arxiv.org/abs/2504.03957 [119] Sun, X., Xie, J., Chen, Z., Liu, Q., Wu, S., Chen, Y., Song, B., Wang, W., Wang, Z., Wang, L.: Divide-then-align: Honest alignment based on the knowledge boundary of rag. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 11461–11480. Association for Computational Linguistics, Online (2025). https://aclanthology.org/2025.acl-long.561 [120] She, Y., Peterson, D.W., Liu, M.M., Upadhyay, V., Chaghazardi, M.H., Kang, E., Roth, D.: RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts. OpenReview. arXiv:2510.05310v1 [cs.CL] (2025). https://openreview.net/attachment?id=LEfrWPOzZr&name=pdf [121] Weller, O., Boratko, M., Naim, I., Lee, J.H.: On the theoretical limitations of embedding-based retrieval. In: Under Review as a Conference Paper at ICLR 2026 (Poster) OpenReview, ??? (2026). Supplementary material: PDF, GitHub repository with LIMIT dataset. https://openreview.net/forum?id=k9CzIvzfaA [122] Arzanipour, A., Behnia, R., Ebrahimi, R., et al.: RAG Security and Privacy: Formalizing the Threat Model and Attack Surface (2025) [123] Noorallahzadeh, M.H., Alimoradi, R., Gholami, A.: Searchable encryption taxonomy: Survey. Journal of Applied Security Research, 1–45 (2022) [124] Bouzefrane, S.: Secure multi-agent retrieval-augmented generation: Enhancing cybersecurity in llm-driven agentic ai systems. Research paper, Sorbonne Uni- versit ́e (March 2025). https://w.sorbonne-universite.fr/sites/default/files/ media/2025-03/BOUZEFRANE%20Samia PRD1.pdf [125] (CSA), C.S.A.: Agentic AI Red Teaming Guide. Cloud Security Alliance, (2025). Cloud Security Alliance. https://cloudsecurityalliance.org/artifacts/ agentic-ai-red-teaming-guide 55 [126] Yuan, J., N ̈other, J., Jaques, N., Radanovi ́c, G.: AgenticRed: Optimizing Agen- tic Systems for Automated Red-teaming. arXiv Preprint. arXiv:2601.13518v2 [cs.CR] (2026). https://arxiv.org/html/2601.13518v2 [127] Roychowdhury, S., Soman, S., Ranjani, H.G., et al.: Evaluation of RAG Metrics for Question Answering in the Telecom Domain (2024) [128] Zhou, P., Feng, Y., Yang, Z.: Provably Secure Retrieval-Augmented Generation (2025) [129] Chen, S., Li, Y., Zhang, Y., Wang, H., Liu, Z.: ZKPROV: A Zero-Knowledge Approach to Dataset Provenance for Large Language Models. arXiv Preprint. arXiv:2506.20915v1 [cs.CR] (2025). https://arxiv.org/html/2506.20915v1 [130] Ramakrishnan, B., Balaji, A.: Securing AI Agents Against Prompt Injection Attacks: A Comprehensive Benchmark and Defense Framework. arXiv Preprint. arXiv:2511.15759v1 [cs.CR] (2025). https://arxiv.org/html/2511.15759v1 [131] Castagnaro, A., Salviati, U., Conti, M., et al.: The Hidden Threat in Plain Text: Attacking RAG Data Loaders (2025) [132] Fu, H., Ni, B., Xu, H., et al.: Ensemble Privacy Defense for Knowledge-Intensive LLMs against Membership Inference Attacks (2025) [133] Cheng, Y., Zhang, L., Wang, J., et al.: RemoteRAG: A Privacy-Preserving LLM Cloud RAG Service (2024) [134] Wang, B., Lou, Q., Zheng, M., et al.: PIR-RAG: A System for Private Information Retrieval in Retrieval-Augmented Generation (2025) [135] Le, T., Behnia, R., Guajardo, J., et al.: Muses: Efficient multi-user searchable encrypted database. In: Proceedings of the 33rd USENIX Security Sympo- sium, p. 2581–2598. USENIX Association, Philadelphia, PA, USA (2024). https://w.usenix.org/conference/usenixsecurity24/presentation/le [136] Yao, H., Shi, H., Chen, Y., et al.: ControlNET: A Firewall for RAG-based LLM System (2025) [137] Reinhard, P., Li, M.M., Fina, M., et al.: Fact or fiction? exploring explana- tions to identify factual confabulations in rag-based llm systems. In: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25), p. 1–13. ACM, Yokohama, Japan (2025). https://doi.org/10.1145/ 3706599.3720249 . https://doi.org/10.1145/3706599.3720249 [138] Mao, Y., Dong, X., Xu, W., et al.: Fit-rag: Black-box rag with factual infor- mation and token reduction. ACM Transactions on Information Systems 43(2), 1–27 (2025) 56 [139] Bhowmick, A., Singh, K., S, R., et al.: Raguru: A tool to create and automatically deploy workload optimized rag. In: Companion of the 16th ACM/SPEC Inter- national Conference on Performance Engineering (ICPE Companion ’25), p. 109–113. ACM, Toronto, ON, Canada (2025). https://doi.org/10.1145/3680256. 3721326 . https://doi.org/10.1145/3680256.3721326 [140] Cuconasu, F., Trappolini, G., Siciliano, F., et al.: The power of noise: Redefining retrieval for rag systems. arXiv preprint (2024) [141] Jung, J., Jeong, H., Huh, E.-N.: Federated Learning and RAG Integration: A Scalable Approach for Medical Large Language Models (2024) [142] Liang, H., Zhou, Y., Gurbani, V.K.: Efficient and verifiable responses using retrieval augmented generation (rag). In: Proceedings of the 4th Interna- tional Conference on AI-ML Systems (AIMLSystems 2024), p. 1–6. ACM, Baton Rouge, LA, USA (2024). https://doi.org/10.1145/3703412.3703431 . https://doi.org/10.1145/3703412.3703431 57