Paper deep dive
Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage
Saron Samuel, Alexander Martin, Eugene Yang, Andrew Yates, Dawn Lawrie, Ian Soboroff, Laura Dietz, Benjamin Van Durme
Abstract
Abstract:Retrieval-augmented generation (RAG) systems combine document retrieval with a generative model to address complex information seeking tasks like report generation. While the relationship between retrieval quality and generation effectiveness seems intuitive, it has not been systematically studied. We investigate whether upstream retrieval metrics can serve as reliable early indicators of the final generated response's information coverage. Through experiments across two text RAG benchmarks (TREC NeuCLIR 2024 and TREC RAG 2024) and one multimodal benchmark (WikiVideo), we analyze 15 text retrieval stacks and 10 multimodal retrieval stacks across four RAG pipelines and multiple evaluation frameworks (Auto-ARGUE and MiRAGE). Our findings demonstrate strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses at both topic and system levels. This relationship holds most strongly when retrieval objectives align with generation goals, though more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness. These findings provide empirical support for using retrieval metrics as proxies for RAG performance.
Tags
Links
- Source: https://arxiv.org/abs/2603.08819v2
- Canonical: https://arxiv.org/abs/2603.08819v2
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/13/2026, 12:55:35 AM
Summary
This paper investigates the relationship between upstream retrieval quality and downstream RAG information coverage. Through experiments across text (TREC NeuCLIR 2024, TREC RAG 2024) and multimodal (WikiVideo) benchmarks, the authors demonstrate that coverage-based retrieval metrics are reliable proxies for RAG performance. They find strong correlations at both topic and system levels, though complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness.
Entities (6)
Relation Signals (3)
Retrieval Metrics → indicates → RAG Information Coverage
confidence 95% · Our findings demonstrate strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses
Auto-ARGUE → evaluates → RAG
confidence 90% · Auto-ARGUE [56] applies ARGUE [39] nugget evaluation to RAG systems
Iterative RAG Pipelines → decouples → Retrieval Effectiveness
confidence 85% · more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness
Cypher Suggestions (2)
Identify relationships between retrieval metrics and RAG performance · confidence 95% · unvalidated
MATCH (s:Entity)-[r:INDICATES]->(o:Entity) WHERE s.name CONTAINS 'Retrieval' RETURN s.name, r.relation, o.name
Find all benchmarks used in the study · confidence 90% · unvalidated
MATCH (e:Entity {entity_type: 'Benchmark'}) RETURN e.nameFull Text
69,638 characters extracted from source content.
Expand or collapse full text
Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage Saron Samuel Alexander Martin Johns Hopkins University Baltimore, MD, USA ssamue21@jhu.edu amart233@jhu.edu Eugene Yang Andrew Yates Dawn Lawrie Johns Hopkins HLTCOE Baltimore, MD, USA andrew.yates@jhu.edu eugene.yang@jhu.edu lawrie@jhu.edu Ian Soboroff National Institutes of Standards and Technology Gaithersburg, MD USA ian.soboroff@nist.gov Laura Dietz University of New Hampshire Durham, NH, USA vandurme@jhu.edu Benjamin Van Durme Johns Hopkins University Baltimore, MD, USA vandurme@jhu.edu Abstract Retrieval-augmented generation (RAG) systems combine document retrieval with a generative model to address complex information- seeking tasks like report generation. While the relationship between retrieval quality and generation effectiveness seems intuitive, it has not been systematically studied. We investigate whether upstream retrieval metrics can serve as reliable early indicators of the final generated response’s information coverage. Through experiments across two text RAG benchmarks (TREC NeuCLIR 2024 and TREC RAG 2024) and one multimodal benchmark (WikiVideo), we an- alyze 15 text retrieval stacks and 10 multimodal retrieval stacks across four RAG pipelines and multiple evaluation frameworks (Auto-ARGUE and MiRAGE). Our findings demonstrate strong correlations between coverage-based retrieval metrics and nugget coverage in generated responses at both topic and system levels. This relationship holds most strongly when retrieval objectives align with generation goals, though more complex iterative RAG pipelines can partially decouple generation quality from retrieval effectiveness. These findings provide empirical support for using retrieval metrics as proxies for RAG performance. CCS Concepts • Information systems→Information retrieval diversity; Summarization;• Computing methodologies→Natural lan- guage generation. Keywords Retrieval-augmented Generation, Search Result Diversification, Generation, Retrieval, Correlation, Information Coverage 1 Introduction As information systems have moved from merely providing source information for human consumption to providing a synthesized text that summarizes the source information, system architecture has changed dramatically. When the system outputs a ranked list or a search engine result page (SERP), gathering documents that are most likely to be relevant to the user query has been the typical design goal of the systems. The community generally refer to this class of problems as adhoc retrieval. However, when users expect only a piece of text synthesized from the source information, a succinct and well-organized sum- mary is usually more preferable than a simple concatenation of relevant documents [66]. Similar and related information are ex- pected to be grouped, rephrased, or merged [6]; multiple facets should be included if they are all relevant to the user query [41]. Such preference implies that source documents providing the same set of relevant information, while still equally relevant to the query, are not as useful after the first one has been processed by the down- stream summarization model, usually powered by a large language model (LLM) [3,11,16,17,63]. Therefore, the objective of the in- formation system has changed – it should provide a summary of information covering multiple aspects or facets without redundancy. This kind of retrieval problem is generally being referred as report generation [33]. These systems are usually implemented with retrieval-augmented generation (RAG), which combines the strengths of document re- trieval with LLM generation to address complex information seek- ing tasks. Adhoc retrieval, in this case, becomes an upstream com- ponent in the system by providing source information to generation model. However, given the final objective of report generation, it should be sufficient for the upstream retrieval model to provide a relatively small number of documents that cover all needed as- pect in the final generation. To date, these LLMs are not able to faithfully leverage all advertised context window, exhibiting infor- mation loss [34] and hallucination [26] when exceeding certain effective limit. In fact, as long as the LLM has an effective context window, there is need to filter the collection to a smaller subset of documents for the downstream model. Therefore, minimizing the number of document the generation model needs to process should provide benefits to the final generated responses. The problem of retrieving for multiple aspects with penalization on redundant in- formation has been studied in prior retrieval literature on diversity ranking [8], which is equivalent to information coverage [67]. However, the relationship between information coverage of the upstream retrieval and the final generation quality, though logical, has not been systematically studied. Furthermore, end-to- end evaluation of generated responses requires running complete RAG pipelines, which incurs substantial computational cost. Un- like document-level judgments, judgments of generated responses arXiv:2603.08819v2 [cs.IR] 11 Mar 2026 are not easily reusable, so evaluating them incurs substantial fur- ther costs in the form of collecting new judgments from a human or LLM. Moreover, the LLM itself adds variability and noise, as different LLMs and generation strategies can produce divergent outputs even when given identical retrieved context, resulting in noisy signals when attributing effectiveness to each component in the pipeline [4, 7]. In this paper, we aim to provide a systematic study on the rela- tionship between upstream retrieval and the downstream genera- tion quality. Particularly, we focus on information coverage (usually materialized as nugget coverage in recent literature [1,46]) of the fi- nal generation quality since it is the primary purpose of employing a retrieval model in the generation pipeline. Other qualities, such as fluency and faithfulness, can already be attributed to the LLM itself [40], thus are excluded in this study. Particularly, we aim to answer the question: “Is the upstream retrieval quality an early in- dicator of the information coverage of the downstream generation responses?” With our empirical evidence, we show that there is a strong relationship between the two stages. This evidence provides empirical grounds for simplifying the evaluation of information coverage to focus on the upstream retrieval model, reducing both computational costs and experimental noise. Such a simplification is common in the literature such as focusing on recall for first-stage retrieval model [24] and precision at one for question-answering and conversational search (which need to be announced to users through audio). Our analysis operates at multiple levels. Topic-level analysis examines whether better retrieval results (or a ranked list) for a spe- cific query lead to a better generated response. System-level analysis assesses whether employing a more effective retrieval models yields a more effective RAG pipeline in general, measuring by information coverage. For robustness, we examine these relationships across different generation pipelines, from simple retrieve-then-generate approaches [15,30,63] to more complex multi-query and iterative strategies [2,9], to understand whether generation complexity can compensate for weaker retrieval. Our contributions are: •We demonstrate that nugget-oriented retrieval metrics serve as reliable indicators of RAG information coverage, with strong correlations observed at both topic and system levels across text and multimodal benchmarks. •We show that RAG pipeline complexity affects the retrieval- generation relationship. Simpler linear pipelines benefit di- rectly from retrieval improvements, while complex iterative pipelines can partially decouple generation quality from re- trieval effectiveness by adapting queries to retrieval system capabilities. •We validate our findings across multiple generation strate- gies (GPT-Researcher, Bullet List, LangGraph), evaluation frameworks (Auto-ARGUE, MiRAGE), and modalities (text and video), demonstrating the robustness and generalizabil- ity of coverage-based retrieval metrics as proxies for RAG performance. 2 Background 2.1 Retrieval and Evaluation Traditional adhoc retrieval evaluation has focused on document rel- evance. Metrics such as MRR, MAP, and nDCG measure the quality of ranking based on a model’s ability to put relevant document at the top of the ranking with different browsing models [25, 49]. However, retrieval models embedded in a RAG pipeline for com- plex information seeking tasks need to retrieve a broad coverage of information to satisfy the complexity of the information need. Therefore, coverage metrics, such as Sub-topic Recall [12] and훼- nDCG [13] become more appropriate, as they evaluate how well a retrieval system gathers information that collectively addresses all aspects of the user’s information need [10].훼-nDCG [13] incor- porates both relevance and diversity, penalizing redundant infor- mation while rewarding novel relevant content. Sub-topic recall explicitly measures whether retrieval results cover multiple facets of the information needed [12]. These coverage metrics align with the goals of RAG systems, since generation requires comprehensive information to produce complete and accurate responses. These metrics were developed for retrieval diversification tasks, such as TREC Interactive Track [44] and TREC Novelty Track [53]. These tasks generally aim to provide a diverse set of retrieval results covering different potential user intent given a short, ambiguous query. TREC Complex Answer Retrieval Track [42] needs diversity for answering a simple question with a complex answer, covering multiple facet of the question. While methods from these efforts are certainly applicable in report generation task, the goal is improving coverage of information that the user would like to receive instead of solely covering results as broad as possible. Therefore, while metrics from diversification literature are useful, we view diversification as one of the approaches to improve information coverage. 2.2 RAG Pipelines To synthesize a generated response for satisfying complex infor- mation needs, RAG systems usually consist of two primary compo- nents: an upstream retrieval component that identifies and ranks relevant documents from a corpus and a downstream generation component that synthesizes information from retrieved documents to produce coherent responses. While this basic architecture is con- sistent across RAG systems, the strategies for combining retrieval and generation vary significantly in complexity [22]. 2.2.1 Linear RAG Pipelines. The simplest RAG architecture follows a retrieve-then-generate pattern. This is where the system retrieves documents once based on the user’s query and then generates a response from the retrieved context. Crucible [15] and GINGER [30] are two examples, where relevant information is extracted from the top-푘from a diverse ranking. After retrieval, such systems follow a multi-step response generation stage for the retrieved content via nugget extraction and detection, clustering, ranking, summarization, and fluency enhancement. A natural extension introduces sub-query generation, where the system decomposes complex queries into multiple sub-queries to retrieve more diverse information. GPT Researcher (GPT-R) [17,18] and Bullet List (extractive approach in Yang et al. [63]) generate multiple queries to gather information from different perspectives 2 before generation. While sub-query generation could be seen as part of the retrieval stack, similar to query expansion [32] or query rewriting [9] techniques and pseudo relevance feedback in tradi- tional retrieval pipeline, in this paper we treat it as part of the generation pipeline, as it typically involves LLM decomposition that is tightly coupled with the generation process. 2.2.2 Iterative RAG Pipelines. More sophisticated RAG systems employ iterative strategies that alternate between retrieval and generation multiple times, such as Self-RAG, Im-rag, SIM-RAG, etc[3,19,23,28,60,61]. After an initial retrieval step, the system analyzes retrieved documents to identify information gaps, then performs additional retrieval iterations to fill those gaps. This itera- tive process continues until the system determines it has sufficient information to generate a comprehensive response. This iterative retrieval could also be viewed as part of the retrieval stack, similar to relevance feedback mechanisms in classical IR [32]. But the inte- gration with generation logic and the LLM decision making about when and what to retrieve each iteration leads us to consider it part of the generation pipeline for our analysis. 2.3 RAG Evaluation RAG evaluation presents unique challenges because it assesses generated responses. In the context of tasks like report generation where document citations are required, the evaluation must as- sess both whether the generated response accurately addresses the user’s information needs and whether the generated response re- flects and cites information in the retrieved documents. Various eval- uation frameworks have emerged to address these challenges, such as ARGUE [39], AutoNuggetizer [46], EXAM [20], MiRAGE [38], and RUBRIC [21]. Nugget-based evaluation provides an framework to measuring information coverage. Human assessors identify atomic units of information (nuggets) that constitute a complete answer, then eval- uate generated responses based on how many nuggets they contain. Auto-ARGUE [56] applies ARGUE [39] nugget evaluation to RAG systems by representing nuggets as question-answer pairs. Given a query and a set of retrieved documents, assessors create QA pairs that capture distinct pieces of relevant information. Generated responses are then evaluated by checking which QA pairs they answer correctly. This approach provides fine-grained coverage measurement without decomposing the generation into subclaims. MiRAGE [38] takes a different approach by representing nuggets needs as claims that can be decomposed into subclaims. The frame- work evaluates both whether subclaims are attested in the gen- erated response (recall) and whether attested claims are properly cited with correct sources. MiRAGE’s citation recall metric aligns with our goals since a nugget should both appear in the generated response and be grounded in the correct source document. We use the Auto-ARGUE [56] and MiRAGE [38] to investigate whether retrieval systems that gather more comprehensive informa- tion lead to generated responses with higher information coverage. We consider the relationship between information coverage of re- trieved documents and of generated responses at both the topic level (i.e., whether improving retrieval coverage on a specific topic leads to better generation coverage on that topic) and at the system level (i.e., whether retrieval systems that have better coverage on average lead to generated responses that have better coverage on average). 3 Research Questions To investigate the relationship between retrieval and RAG infor- mation coverage, we formulate the following research questions to guide the exploration. RQ1 For a given RAG pipeline, does an input rank list that covers more information lead to a more effective gen- erated response for a given topic? To address this question, we investigate the correlation be- tween retrieval effectiveness and the generated response at the topic level (i.e., correlation on the topic scores instead of first averaging over the topics) over various retrieval stacks. This analysis provides insight into how retrieval effective- ness on a specific topic directly impacts the downstream generated response’s information coverage. RQ2For a given RAG pipeline, does using a more effective retrieval system as a component lead to a more effec- tive RAG system? Practically, when building a RAG pipeline, practitioners need to pick a retrieval system as the pipeline component. This system should be effective for all (or on average) user queries. Therefore, we investigate the correlation between a retrieval system’s average evaluation metric values and the gener- ation response’s average nugget coverage. We investigate this relationship both within and across domains, as well as on the same and distinct retrieval task objectives. Such experiments allow us to simulate a practical system design process where the retrieval system’s effectiveness cannot be measured on the topics that users would issue to the RAG system. RQ3 Can a more complex RAG pipeline compensate a less effective retrieval system? RAG pipelines vary in complexity, ranging from receiving a fixed ranked list of documents from the retrieval model based on the user’s query to iteratively reflecting and gener- ating queries to issue to the retrieval model. We investigate whether the relationships found in RQ1 and RQ2 are still applicable to more complex RAG pipelines. RQ4Do these relationships hold across different RAG eval- uators? Given that generation tasks like report generation typically do not have a single correct answer, RAG evaluation re- quires automatic evaluation systems to score the generated responses. We further investigate the robustness of our find- ings on the previous research questions across evaluation approaches. RQ5 Do these relationships hold in multimodal RAG? Finally, we investigate our findings on a multimodal report generation task for further evidence on the generalizability of our findings. Since multimodal and video RAG experiments are much more expensive to conduct, we consider a single, representative generation system in our experiments. 3 4 Experiment Setup To systematically investigate the five research questions, we con- duct our experiments across two text and one multimodal genera- tion task to support both within-dataset and cross-dataset analysis. 4.1 Evaluation Datasets For end-to-end RAG evaluation, we use TREC NeuCLIR 2024 Re- port Generation Pilot Task (NeuCLIR24) [33] and TREC RAG 2024 (RAG24) [46,54,55] as our two text RAG tasks because of their nugget annotations, which enable information coverage evaluation. NeuCLIR24 is a multilingual report generation task where the user input is a problem statement that expresses a rich information need in a paragraph of text, along with a background that describes the profile of the user. NeuCLIR24 has 19 topics assessed for the report generation pilot. The document collection contains more than 10 million news articles extracted from Common Crawl in Chinese, Persian, and Russian. RAG24 is a question-answering task that requires the system to retrieve multiple documents as supporting evidence for the final answer. RAG24 uses MS MARCO Segment v2.1 [45] as the document collection with 55 judged queries. These questions, for example,“why are trade-offs so important to the success of a business?”, while not as simple as factual questions, can often be answered by a single document. However, there are multiple documents that contain acceptable answers that can be retrieved. Additionally, we use WikiVideo [37], an event-centric article writing task using video as the supporting documents, to further verify our findings with multimodal RAG. WikiVideo contains 109K videos and 57 topics with an average of 8 relevant videos per topic (maximum 10 videos) We refer to the user input for the RAG tasks all as queries, and ac- knowledge the nuanced differences between the kind of user input in each task. To evaluate the retrieval models, we use the nuggets in each collection to form a nugget-based qrels that treats each nugget in a topic as an aspect or group. This format is supported by various retrieval evaluation toolkits, includingtrec-eval,ndeval, andir-measures[36]. Additionally, we also include MultiVent 2.0 [29], containing the superset of WikiVideo videos, as additional retrieval tasks for analysis. 4.2 Retrieval Systems We include retrieval systems that combine different first-stage re- trieval and reranking methods. For consistency, when reranking, we always rerank the top 100 documents from the first-stage model. These retrieval stacks are served using RoutIR [64], a serving pack- age that wraps retrieval models with pipeline APIs to support in- teractions between retrieval and generation systems. For NeuCLIR24 and RAG24, we use the following first-stage re- trieval models, covering common first-stage retrieval architectures: • BM25 [50] as our lexical retrieval baseline. •PLAID-X [62], a strong multilingual late-interaction model that achieves the best first-stage effectiveness in TREC Neu- CLIR 2023 and 2024. •LSR. We use MILCO [43], a multilingual learned sparse re- trieval (LSR) model that projects text into a shared English lexical space for NeuCLIR24 and SPLADEv3 [31], a strong English LSR model, for RAG24. •Qwen3-8B Embed [59,65], a multilingual dense embedding model built on the Qwen3. •3-way RRF, which combines results from PLAID-X, LSR, and Qwen3-8B Embed with Reciprocal Rank Fusion [14] To further improve retrieval effectiveness, we employ the two rerankers to rerank the first-stage retrieval results from each method: •Qwen3-8B Reranker [65], a multilingual pointwise reranking model built on Qwen3 [59], with state-of-the-art effective- ness in various retrieval tasks. •Rank1-7B [57], a pointwise reasoning reranking model that allows test-time scaling based on Qwen2.5. For WikiVideo, we use the following multimodal retrieval models as the first-stage retrieval model that leverage different modalities: •CLIP [47] is a pretrained text-image model. To retrieve videos, we take 16 frames and use the maximum similarity between the query-frame pairs. •LanguageBind [68], a multimodal framework that uses lan- guage to embed different modalities in a shared space. •MMMORRF [51], a video retrieval system that combines vi- sual, audio, and text features with modality-aware weighted reciprocal rank fusion. •Video-ColBERT [48], a late-interaction text-to-video model. •OmniEmbed [35], a omnimodal dense encoder built on Qwen2.5 Omni [58]. For WikiVideo, we employ RankVideo [52], a reasoning reranker for pointwise reranking in text-to-video retrieval. Overall, we investigate 15 retrieval stacks (5 first-stage models with and without the 2 rerankers) for the NeuCLIR24 and RAG24 tasks; for WikiVideo, we use 10 stacks. 4.3 RAG Pipelines We use four RAG pipelines for NeuCLIR24 and RAG24 tasks: GPT- Reseearch (GPT-R) [17,18] with one and three queries, Bullet List [63], and LangGraph [16]. All generation pipelines are given a processing budget of 50 documents to enable a fair comparison between pipelines. All text RAG pipelines use Llama-70B-Instruct as the backbone LLM. GPT-Researcher (GPT-R) [17,18] is used as a cascade system that retrieves a seed set of documents given the query, generates additional sub-queries, retrieves documents with the original query and sub-queries, and then aggregates information into reports. We experiment with one query (i.e., no subquery generation) and three queries (i.e., generating two subqueries) with this pipeline. Bullet List [63] is an extractive system that first generates 10 Google-like queries, retrieves the top 5 documents per query, ex- tracts key related facts from each document, and finally groups the facts with an LLM. The pipeline is implemented with DSPy [27]. Finally, we implement an iterative system with LangGraph [16] that iteratively generates sub-queries, retrieves, and compresses documents. It uses reflection to identify knowledge gaps and trigger additional retrieval loops, then drafts and revises responses. Since this pipeline is computationally expensive, we only experiment with NeuCLIR24. 4 For WikiVideo, we use CAG [37] with a Qwen2.5-VL-72B back- bone [5]. CAG takes a query and the top 10 videos from the retrieval stack as the input, first extracts key video information related to the query from each video, and then aggregates them into a single response, much like Bullet List but without subquery generation. 4.4 Retrieval Evaluation To study the relationship between retrieval and RAG, we indepen- dently evaluate retrieval models outside of the RAG pipelines to assess the system effectiveness. We evaluate on both coverage-based and relevance-based metrics. To measure information coverage, we derive nugget qrels from the nuggets in each RAG collection. Specifically for RAG24, since the nugget-document alignment (i.e., the mapping of nuggets to documents containing them) was never recorded (nor completely assessed), we use the same Llama 70B model to judge each relevant document for each nugget in the topic. When assuming a document is relevant if it contains a nugget, our LLM-Judge achieves a preci- sion of 69% and a recall of 90% across all 55 topics compared to the RAG24 relevance judgments. Given that the relevance assessment in RAG24 does not require a document to be considered relevant based on nuggets residing in it, we believe this accuracy is reason- able, and the LLM-judged nugget-document alignment can be used to evaluate retrieval effectiveness for the purpose of this study. 1 We report three coverage metrics,훼-nDCG, nDCG using the nugget-based qrels, and Subtopic Recall (StRecall). We use a rank cutoff at 20 for NeuCLIR24 and RAG24, whereas we use a cutoff of 10 for WikiVideo since there are at most 10 relevant videos for each query by design. The key differences between these three metrics are the penalty for retrieving redundant nuggets (i.e., nuggets that were already present in another document ranked higher) and how redundancy and document positions affect the score.훼-nDCG discounts the gain by a document’s rank and reduces the gain when a nugget has already been covered by an earlier document. nDCG uses nugget- based qrels without any consideration of whether a nugget was already covered, which means that only documents that contain a nugget contribute to the score, and there is no penalty for nugget redundancy. StRecall is a set measure without any penalty on the ranking that measures the fraction of nuggets covered. Additionally, we report relevance metrics using nDCG with the same cutoff on all datasets. We use the qrels that contain the rel- evance assessments by TREC assessors for relevance-based eval- uation on NeuCLIR24 and RAG24, which means that documents that are labeled relevant do not necessarily contain a nugget. In the tables, we use (R) to indicate the nDCG metric reporting is using the relevance-based qrels instead of nugget-based when label sources are not explicitly mentioned for space reasons. 4.5 RAG Evaluation For NeuCLIR24 and RAG24, we use Auto-ARGUE [56] as the pri- mary evaluator for the RAG responses. It takes nuggets in the form of question-answer pairs, which natively come with NeuCLIR24 1 Since RAG24 did not publish an overview paper, we instead directly communicated with TREC organizers to verify. We will release the prompt and the judged labels upon publication. in the dataset. For RAG24, where the nuggets are in the form of claims, we again used Llama-70B-Instruct to separate them into questions and answers. 2 With human verification, the nuggets are reasonable and faithfully represent the information in the original claims. We evaluate generated responses using Nugget Coverage, which is the proportion of the grounded nuggets covered in the generated response. A nugget is grounded if it is accompanied by a citation to a document containing the nugget. For WikiVideo, we use MiRAGE [38], which is its official eval- uator. MiRAGE is a nugget-based evaluator for multimodal RAG, measuring factuality, information coverage, and citation support. We report Information Precision (InfoP), evaluating factuality, and Information Recall (InfoR), evaluating coverage. Note that the definition of InfoR in MiRAGE and Nugget Cov- erage in Auto-ARGUE is slightly different. InfoR directly assesses whether the nugget is covered in the response, where Nugget Cover- age rewards only nuggets in the response with appropriate citation. This slight difference introduces some nuanced analysis, which we will discuss in the next section. To measure the relationship between metrics, we use Pearson correlation coefficient between metrics. Since we would like to understand whether improvement in retrieval effectiveness is an indicator of improvement in the final RAG responses, we use Pear- son correlation instead of rank correlation to directly capture the value relationship instead of system ranking, which is subject to the choice of and the differences between the systems. 5 Analysis Tables 1 and 2 summarize the effectiveness of models included in this study on both the TREC NeuCLIR 2024 Report Generation Pilot Task (NeuCLIR24) and TREC RAG 2024 (RAG24). The 15 retrieval stacks cover a wide spectrum of retrieval quality, measured by four different metrics, from BM25 generally being the least effective in all metrics to the Qwen3-8B and Rank1-7B reranker being the most effective on NeuCLIR24 and RAG24, respectively. These retrieval models provide the basis for our studies. Similarly, the four RAG pipelines using all 15 retrieval stacks also exhibit varying coverage on nuggets. While the absolute value range is smaller than retrieval, these systems are representative of the common RAG pipelines in the literature, where each represents a particular style of design with its own advantages. Here, Bullet List generally provides the highest nugget coverage for a given retrieval stack in both Neu- CLIR24 and RAG24, while GPT-R with one and three queries varies in its ability to cover nuggets in the generated response with differ- ent retrieval stacks. This clearly demonstrates the noisy nature of evaluating RAG pipelines when the focus is informational instead of the ability of the underlying language model. Nevertheless, these two tables provide a summary of the underlying data that we use for further relationship analysis in this work. 5.1 Topic-Level Analysis With all retrieval and generation combinations, we first calculate the topic-level correlation on each benchmark. Table 3 summarizes the Pearson correlation coefficient of the nugget coverage and the 2 Prompts and the resulting question-answer nuggets will be released upon publication. 5 Table 1: Retrieval Evaluation on NeuCLIR24 and RAG24. Columns marked with (R) indicate that the qrels for that metric are the relevance-based qrels instead of the nugget-based ones. All metrics in this table use a rank cutoff of 20. NeuCLIR24 Report Generation Pilot NuggetsRAG24 Nuggets First StageReranker 훼 -nDCGnDCGStRecall(R) nDCG훼 -nDCGnDCGStRecall(R) nDCG BM25 – 0.3490.3280.5450.1700.4500.2070.7080.263 Qwen3-8B 0.5830.5810.6840.3460.6330.3180.8190.410 Rank1-7B0.5100.5200.6290.2930.6460.3430.8040.369 PLAID-X – 0.4460.6120.6370.2950.7090.3640.9190.493 Qwen3-8B 0.6340.7880.7840.4160.7180.3850.9030.606 Rank1-7B0.5860.7620.7630.4110.7300.4130.9040.542 LSR –0.6580.7660.8530.3510.6670.3170.8740.473 Qwen3-8B0.7200.8920.8740.4960.6970.3650.8830.578 Rank1-7B 0.6510.8770.8480.4580.7330.4120.9080.537 Qwen-8B Embed –0.6270.8190.8360.3900.6830.4010.8990.547 Qwen3-8B 0.6910.8600.8390.4680.7050.3940.9030.605 Rank1-7B 0.6130.8200.7860.3870.7420.4540.9350.548 3 Way RRF –0.6370.7960.8510.3740.7050.3660.8990.587 Qwen3-8B0.7020.8700.8520.4820.7090.3820.9060.623 Rank1-7B0.6530.8700.8400.4300.7390.4320.8960.609 Table 2: Nugget Coverage of RAG responses on NeuCLIR24 and RAG24 using AutoARGUE. NeuCLIR24 Report Generation PilotRAG24 RAG Task First StageRerankerGPT-R (1)GPT-R (3)Bullet ListLangGraphGPT-R (1)GPT-R (3)Bullet List BM25 –0.4490.5270.5700.4900.4960.5120.602 Qwen3-8B0.5410.5090.6020.5590.5100.2630.618 Rank1-7B0.4840.5610.6010.5740.5120.5480.559 PLAID-X –0.5040.5230.5860.5450.5990.5880.647 Qwen3-8B0.5830.5620.6070.5450.5850.5970.652 Rank1-7B0.5880.5880.5810.5520.5890.5510.623 LSR –0.5180.5680.6390.5520.5680.5890.654 Qwen3-8B0.5680.5470.5990.5690.5190.5710.664 Rank1-7B0.5790.5990.5840.5340.5870.5860.635 Qwen-8B Embed – 0.5870.5820.6080.5600.5920.6030.662 Qwen3-8B0.5710.5460.5830.5320.5620.5790.658 Rank1-7B 0.5870.5850.6040.5200.5570.5940.652 3 Way RRF –0.6000.5750.6060.5470.6080.6110.662 Qwen3-8B0.6110.5990.5970.5460.5820.6200.662 Rank1-7B0.5820.5500.6100.4900.5700.5790.649 retrieval effectiveness using the respective retrieval metric. Regard- less of the generation pipeline, all generated responses correlate with훼-nDCG using nugget-based labels, the highest (with the ex- ception of Bullet List on NeuCLIR24, where correlation is equally high across all nugget-based metrics), indicating that RAG pipeline integrating a retrieval system that is capable of retrieving a wide coverage of information is likely to produce a response that also has a high coverage in information. Nugget-based nDCG@20 also provides a strong indicator. However, since it does not penalize the redundancy of retrieving documents with duplicated information, it may overestimate the usefulness of a document for the downstream generation model in a ranking, thus leading to lower correlation to the nugget coverage. Interestingly, StRecall, which is a set-based metric, in general, correlates more strongly with the nugget cover- age than nDCG@20. While it also does not explicitly penalize the model for including redundant documents, it directly evaluates the set of documents, similar to how the downstream generation model uses the retrieval results. Therefore, this is a metric that aligns with how retrieval results are being used in the pipeline. Since each generation pipeline consumes the retrieval results differently, 6 but always takes documents from the top of the ranking,훼-nDCG provides the most robust evaluation across downstream pipelines. These three metrics all evaluate the retrieval model on the same information objective, which is coverage. We can also compare the correlation of RAG nugget coverage with the relevance-based metric that evaluates the retrieval stacks. The last row in Table 3 presents such correlations. Note that although the topics are still paired between retrieval and generation evaluation, the objectives are now different, where the labels used in retrieval evaluation are the relevance of each document. 3 Therefore, this correlation essentially asks the question: Does a retrieval model that retrieves more relevant documents to the topic provide useful information for downstream generation pipelines to improve nugget coverage? In NeuCLIR24, where information needs are complex and rich re- port requests, higher relevance does not indicate higher nugget coverage (low correlations). While in RAG24, where information needs are short queries, relevance-based nDCG is still a reasonable indicator for the final nugget coverage, likely because of the kind of information needs are more likely to be satisfied by a single, very relevant document, thus high coverage is not always needed from the retrieval stack. Overall, to directly answer our RQ1: For a given RAG pipeline, does an input rank list that covers more information lead to a more effective generated response for a given topic? Yes, with the strong correlation between nugget-based metrics on the retrieval ranked lists and the nugget coverage on the downstream generated responses, we conclude that input ranked lists with higher infor- mation coverage tends to lead to generated responses with higher nugget coverage. 5.2 System-level Analysis While topic-level analysis is useful, it is not practical when design- ing a RAG pipeline. Practitioners often select components based on external benchmark evaluations on queries different from those issued to the actual application. Table 4 presents the correlation on the system level, where each metric is first averaged over the topics in the respective benchmark before calculating the correlation. For the two in-task sets of correlations (yellow blocks), nugget- based retrieval metrics provide even stronger correlations, indicat- ing that a retrieval system that is generally effective in retrieving wide coverage of information for a specific task would be a good candidate as the upstream retrieval model in a RAG pipeline. Com- pared to topic-level correlations, the relationship shown in Table 4 treats the retrieval metric as a property of the retrieval stack, which is untied from any specific topic. We also observe a similar relation- ship to the relevance-based retrieval metrics, while relevance-based nDCG@20 on RAG24 shows slightly stronger correlation to the nugget coverage than 훼 -nDCG. System-level analysis allows us to investigate this relationship across different collections and even different tasks (uncolored blocks in Table 4). When evaluating retrieval on RAG24 RAG task and comparing nugget coverage on NeuCLIR24 (lower left block), it shows lower but still strong correlations compared to evaluating 3 NeuCLIR24 explicitly tasked the annotators to assess how useful a particular document is for writing a report without considering each piece of information in the document, while RAG24 simply asked the annotators to assess the relevancy. with the same collections. If we move even further away to RAG24 Retrieval task (the last row in Table 4), the correlation becomes lower except for GPT-R with one query. This exception is likely due to the nature of GPT-R when using only one query. It ingests the top 50 documents from the single retrieval result, so nDCG at that point essentially evaluates how well the model recalls all the relevant documents, resulting in a very similar correlation as the nugget-based nDCG. We observe very similar trends on the RAG24 RAG task (the right portion of Table 4). These results indicate that for the two conditions that we ex- perimented with (retrieval objectives and evaluation), matching both on retrieval and RAG evaluation leads to the strongest cor- relation. Relaxing one gives a similar level of correlation, where the detailed values vary across RAG pipelines. When mismatching both, retrieval evaluation becomes less useful as an early indicator of the final RAG nugget coverage, especially for iterative pipelines such as LangGraph, with essentially no correlation (r=0.0941). Since the RAG queries in RAG24 is less complex, evaluating retrieval ef- fectiveness from any aspect could provide some level of indication. Therefore, to answer RQ2: For a given RAG pipeline, does us- ing a more effective retrieval system as a component lead to a more effective RAG system? Yes, particularly when measuring re- trieval systems with a metric that matches the objectives of the final RAG systems. Evaluating on different benchmarks or with different objectives, is still information, though with lower correlations. Looking across RAG pipelines, similar to the topic-level analysis, GPT-R with one query shows a strong correlation to the nugget- based retrieval metrics, while using three queries (two generated based on the input queries) starts to deviate. Furthermore, Bullet List generates 10 queries from the original user input to try cover- ing various aspects the user may want to see, which exhibit even lower correlation across all retrieval metrics compared to GPT-R on NeuCLIR24. In RAG24, again, since the queries are shorter and factual, Bullet List essentially performs query expansion with para- metric knowledge in the underlying LLM. The net effect becomes query expansion instead of improving information coverage since there are only a few aspects to cover. LangGraph has an even more complex interaction with the retrieval model, where it iteratively ingests documents and generates queries to fill in the missing infor- mation, leading to lower correlations to the retrieval metrics. While LangGraph may not be the most effective RAG pipeline among the pipelines we have included based on Table 2, it is the most de- tached one from the retrieval effectiveness. It is capable of achieving nugget coverage of 0.559 with Qwen3-8B Reranker reranking BM25, which is among the highest across its 15 variants using different re- trieval stacks, but only 0.532 when using Qwen3-8B Embed followed by Qwen3-8B Reranker. However, BM25 followed by Qwen3-8B Reranker only achieves 0.583훼-nDCG@20, which is much lower than Qwen3-8B Embed followed by Qwen3-8B Reranker (0.691). This detachment leads to a different design pattern: the LLM needs to adapt to the retrieval model to issue queries that the re- trieval model can understand. For example, when using a lexical retrieval model such as BM25, the LLM needs to produce lexical queries for exact matches to the pieces of information that the users need; for retrieval models that are capable of understanding semantic queries, such as Qwen3 Embed, the LLM can form con- ceptual queries to get fuzzy matches from the retrieval. We observe 7 Table 3: Topic-level Pearson correlation coefficients between RAG Nugget Coverage and respective retrieval metrics. Each value for NeuCLIR24 Pilot and RAG24 is calculated based on 285 (19 topics×15 retrieval stacks) and 880 (55 topics×15 retrieval stacks) pairs of values, respectively. All metrics in this table use a rank cutoff of 20. NeuCLIR24 PilotRAG24 LabelMetric GPT-R (1)GPT-R (3)Bullet ListLangGraphGPT-R (1)GPT-R (3)Bullet List Nugget훼 -nDCG0.55860.34890.26450.33430.44190.37850.3153 NuggetnDCG 0.43290.27140.26230.16290.31140.25640.1857 NuggetStRecall 0.49460.29070.26940.22160.38050.32310.2844 RelevancenDCG0.1407-0.01310.0458-0.02390.34670.30900.2881 Table 4: System-level Pearson correlation between RAG Nugget Coverage and respective retrieval metrics and tasks. Cells with a yellow background indicate correlations on the same task (i.e., same objective) and same benchmark; ones with a purple background indicate correlations on the same benchmark but different tasks (i.e., different objectives). All correlations are calculated across RAG pipelines using 15 different retrieval stacks. All metrics in this table use a rank cutoff of 20. NeuCLIR24 PilotRAG24 Retrieval TaskMetricGPT-R (1)GPT-R (3)Bullet ListLangGraphGPT-R (1)GPT-R (3)Bullet List NeuCLIR24 Pilot Report Generation 훼 -nDCG0.81050.48940.46250.28930.35430.26520.7580 nDCG0.88100.59000.33460.13600.59180.48550.9391 StRecall0.82510.58860.46420.16080.52720.45020.8459 NeuCLIR24 MLIRnDCG0.85180.49600.17210.23490.40610.30370.8160 RAG24 RAG Task 훼 -nDCG0.77920.50480.29630.25570.68590.40270.8745 nDCG0.79530.56290.15720.07260.60530.41160.8391 StRecall0.78900.51610.27090.19410.79230.50600.8933 RAG24 RetrievalnDCG0.90280.49150.23900.09410.69380.51490.9130 Table 5: System-level Pearson correlation between InfoR from MiRAGE and retrieval metrics on NeuCLIR24. The top row additionally reports the macro-averaged InfoR values across all 15 retrieval stacks for each RAG pipeline. Retrieval metrics reported in this table use a rank cutoff of 20. GPT-R (1)GPT-R (3)Bullet ListLangGraph Avg. InfoR0.72790.73870.69520.7325 NeuCLIR24 Pilot 훼 -nDCG0.67770.30550.46230.2050 nDCG0.65810.25460.62900.1069 StRecall0.58740.36260.60500.2291 (R) nDCG0.68690.22490.42270.1434 RAG24 훼 -nDCG0.66260.12430.6121-0.0301 nDCG0.74500.18290.6047-0.2034 StRecall0.5420-0.01600.6484-0.0930 (R) nDCG0.71000.13950.52260.0364 this behavior in our LangGraph self-reflection process. The focus of development in improving the final generation output moves away from improving the underlying retrieval model, but rather the adaptivity of the LLM to the specific retrieval model. This may be preferable for some applications or production environments where the collection is large or only a limited set of retrieval systems is available. Based on our experiments, a simpler (or less iterative) RAG pipeline benefits from the improvement of the retrieval model more directly, which may be more preferable for most applications since finetuning LLM for a specific task or use case is more challeng- ing and computationally expensive than adapting a more effective retrieval model, which is usually more economical. So to answer RQ3: Can a more complex RAG pipeline com- pensate a less effective retrieval system? Potentially, but it is not guaranteed. Using a more complex RAG pipeline can detach the final response quality from the effectiveness of the underlying retrieval model, as we have shown with our correlation analysis. However, detachment does not always lead to improvements in the final generation quality, as shown in Table 2. The key performance bottleneck in this case shifts from retrieving useful documents to better interaction from the LLM to the retrieval model. 5.3 Using a Different RAG Evaluator To further understand the robustness of this relationship, we employ MiRAGE as an alternative RAG evaluator to evaluate the generated responses on the NeuCLIR24 task. For brevity, we only present the system-level Pearson correlation between Information Recall (InfoR) from MiRAGE and the retrieval metrics in Table 5. We omit the detailed InfoR scores from the paper and only present the averaged InfoR scores over the 15 retrieval stacks for each RAG pipeline. The overall trend is similar except that MiRAGE prefers LangGraph slightly more than Auto-ARGUE. Despite slightly different preferences and slightly different metric 8 Table 6: Multimodal Retrieval and Report Generation Evaluation Results on WikiVideo using MiRAGE. All retrieval metrics use a rank cutoff at 10 since there are at most 10 relevant videos for each topic. MultiVENT 2.0WikiVideo RetrievalWikiVideo Gen First StageRerankerRecallnDCG훼 -nDCGnDCGStRecall(R) nDCGInfoPInfoR CLIP – 0.3330.3060.5380.4980.7240.49881.589.0 ReasonRank0.4770.4780.6280.6290.8260.62991.987.6 Language Bind –0.3550.3260.4760.4570.6340.45784.089.2 ReasonRank 0.4980.4870.5660.5630.7540.56394.586.6 Video-ColBERT –0.3410.4220.4310.3830.6340.38383.889.9 ReasonRank 0.4480.5350.5530.5220.7340.52287.187.6 OmniEmbed –0.5230.4950.5300.4540.7210.45488.090.9 ReasonRank 0.5900.5660.5840.5870.7780.58791.288.6 MMMORRF –0.6110.5850.5400.5030.7240.50394.488.0 ReasonRank 0.6340.6390.6050.6170.7850.61791.888.2 Table 7: System-level Pearson correlation between retrieval metrics and Info metrics on WikiVideo. Retrieval metrics reported in this table use a rank cutoff of 10. Retrieval TaskMetricInfoPInfoR WikiVideo Retrieval 훼 -nDCG 0.6476-0.5821 nDCG0.6528-0.6825 StRecall0.6530-0.5405 (R) nDCG0.6528-0.6825 MultiVent 2.0 Recall0.8447-0.2837 nDCG 0.7799-0.3269 definitions, correlations between InfoR and nugget-based retrieval metrics are still very strong. When comparing with relevance- based nDCG on NeuCLIR, the correlations are still high across RAG pipelines. Interestingly, when evaluating retrieval with RAG24 tasks, we still observe the strong correlation on GPT-R with one query, but the correlations on LangGraph drop from slightly pos- itive to zero with AutoARGUE down to zero to slightly negative. This indicates that under a more lenient RAG information coverage metric (i.e., a metric that does not ignore the information when a sentence is not supported by its citation), LangGraph information is even more detached from the retrieval model, while other systems still exhibit positive correlation with the retrieval metrics. Therefore, to answer RQ4: Do these relationships hold across different RAG evaluators? Yes, the relationship between retrieval and RAG information coverage evaluation still holds when using a different evaluator. However, the nuanced difference between evaluators and their metrics may lead to slightly different results since they are measuring the systems from different aspects and based on slightly different criteria. 5.4 Multimodal RAG To further validate our findings, in Table 6, we summarize the effec- tiveness of the multimodal retrieval and generation systems on the WikiVideo video retrieval and article generation task [37]. The 10 retrieval combinations offer a variety of retrieval quality and vari- ance in both human- and claim-based relevance evaluation. Because of the computational cost of multimodal RAG systems, we study only one representative video RAG system using all 10 retrieval stacks. In Table 6, the 10 retrieval stacks exhibit a wide range of retrieval effectiveness on both an external collection (MultiVent 2.0) and the same collection (WikiVideo) as the generation task. In Table 7, we perform a system-level analysis on the Pearson correlation coefficient of MiRAGE’s Information Precision (InfoP, measuring factuality) and Information Recall (InfoR, measuring coverage) and retrieval effectiveness. Overall, we observe a strong correlation between the factuality of a generation system and re- trieval performance instead of information coverage. As shown in Martin et al. [37], multimodal language models are often overfitted on parametric knowledge, resulting in a high tendency in lever- aging the parametric knowledge for responding to user queries instead of investigating retrieved materials, similar to the effect of LangGraph in the text RAG analysis. Since topics in WikiVideo are prominent events from 2015 to 2023, information on these events is almost certainly part of the pretraining data. The purpose of retrieval, in this case, becomes verifying the parametric knowledge for improving the factuality of the generation instead of coverage. Such intuition aligns with our observation in Table 7. Nevertheless, retrieval effectiveness is still a strong indicator for the quality of the generation responses. Overall, to directly answer our RQ5: Do these relationships hold in multimodal RAG? Yes, with the strong positive corre- lation between factuality and retrieval effectiveness. We expect that when evaluating multimodal RAG benchmarks that actually require gathering information from the collection (which, to our knowledge, does not exist yet), information coverage would again correlate with the retrieval information coverage. 6 Conclusion We investigate the relationships between upstream retrieval effec- tiveness and downstream RAG information coverage across multi- ple retrieval systems, RAG pipelines, and evaluation frameworks. We show that nugget-based retrieval metrics correlate strongly with 9 nugget coverage on the downstream generated response on both topic and system levels, indicating that employing more effective retrieval systems leads to better generated responses, particularly when the retrieval and generation objectives are aligned. More complex or iterative RAG pipelines can decouple this relationship. These results enable more efficient RAG system development by es- tablishing retrieval metrics as reliable proxies for generation quality, reducing the need for costly end-to-end evaluation. Acknowledgments Certain products and software are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement of any product or service by NIST, nor is it intended to imply that the software identified are necessarily the best available for the purpose. References [1] Zahra Abbasiantaeb, Simon Lupart, Leif Azzopardi, Jeffery Dalton, and Mo- hammad Aliannejadi. 2025. Conversational Gold: Evaluating Personalized Conversational Search System using Gold Nuggets. arXiv:2503.09902 [cs.IR] https://arxiv.org/abs/2503.09902 [2] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511 [cs.CL] https://arxiv.org/abs/2310.11511 [3] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In The Twelfth International Conference on Learning Representations. [4]Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, et al.2024. Non-determinism of" deterministic" llm settings. arXiv preprint arXiv:2408.04667 (2024). [5]Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Report. arXiv:2502.13923 [cs.CV] https://arxiv.org/abs/2502.13923 [6] Regina Barzilay, Kathleen R. McKeown, and Michael Elhadad. 1999. Information Fusion in the Context of Multi-Document Summarization. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, College Park, Maryland, USA, 550–557. https: //doi.org/10.3115/1034678.1034760 [7] Andrew Blair-Stanek and Benjamin Van Durme. 2025. Llms provide unstable answers to legal questions. In Proceedings of the Twentieth International Conference on Artificial Intelligence and Law. 425–429. [8]Jaime G. Carbonell and Jade Goldstein. 2018. The Use of MMR and Diversity- Based Reranking in Document Reranking and Summarization. (6 2018). https: //doi.org/10.1184/R1/6610814.v1 [9] Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation. arXiv:2404.00610 [cs.CL] https://arxiv.org/abs/2404.00610 [10]Olivier Chapelle, Shihao Ji, Ciya Liao, Emre Velipasaoglu, Larry Lai, and Su- Lin Wu. 2011. Intent-based diversification of web search results: metrics and algorithms. Inf. Retr. 14, 6 (Dec. 2011), 572–592. [11]Xin Cheng, Xun Wang, Xingxing Zhang, Tao Ge, Si-Qing Chen, Furu Wei, Huishuai Zhang, and Dongyan Zhao. 2024. xRAG: Extreme Context Compression for Retrieval-Augmented Generation with One Token. In Advances in Neural Information Processing Systems (NeurIPS). [12]Charles L.A. Clarke, Nick Craswell, Ian Soboroff, and Azin Ashkan. 2011. A comparative analysis of cascade measures for novelty and diversity. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (Hong Kong, China) (WSDM ’11). Association for Computing Machinery, New York, NY, USA, 75–84. https://doi.org/10.1145/1935826.1935847 [13]Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore) (SIGIR ’08). Association for Computing Machinery, New York, NY, USA, 659–666. https://doi.org/10.1145/1390334.1390446 [14]Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Boston, MA, USA) (SIGIR ’09). Association for Computing Machinery, New York, NY, USA, 758–759. https://doi.org/10. 1145/1571941.1572114 [15] Laura Dietz, Bryan Li, Gabrielle Liu, Jia-Huei Ju, Eugene Yang, Dawn Lawrie, William Walden, and James Mayfield. 2026. Incorporating Q&A Nuggets into Retrieval-Augmented Generation. In Proceedings of the 48th European Conference on Information Retrieval (ECIR 2026). [16]Kevin Duh, Dawn Lawrie, Debashish Chakraborty, Roxana Petcu, Eugene Yang, Kenton Murraya, Daniel Khashabi, and Maxime Dassen. 2025. HLTCOE Genera- tion Team at TREC 2025. In The Thirty-Fourth Text REtrieval Conference Proceed- ings (TREC2025). https://trec-ragtime.github.io/assets/notebooks/2025/hltcoe- gen.pdf [17] Kevin Duh, Eugene Yang, Orion Weller, Andrew Yates, and Dawn Lawrie. 2025. HLTCOE at LiveRAG: GPT-Researcher using ColBERT retrieval. arXiv preprint arXiv:2506.22356 (2025). [18]Assaf Elovic. 2023. gpt-researcher. https://github.com/assafelovic/gpt-researcher [19]Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. 2025. KiRAG: Knowledge- Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation. arXiv preprint arXiv:2502.18397 (2025). [20]Naghmeh Farzi and Laura Dietz. 2024. An Exam-based Evaluation Approach Beyond Traditional Relevance Judgments. arXiv:2402.00309 [cs.IR] https://arxiv. org/abs/2402.00309 [21]Naghmeh Farzi and Laura Dietz. 2024. Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems. In Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval (Washington DC, USA) (ICTIR ’24). Association for Computing Machinery, New York, NY, USA, 175–184. https://doi.org/10.1145/3664190.3672511 [22] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL] https://arxiv.org/abs/2312.10997 [23] Kazuki Hayashi, Hidetaka Kamigaito, Shinya Kouda, and Taro Watanabe. 2025. Iterkey: Iterative keyword generation with llms for enhanced retrieval aug- mented generation. In Proceedings of the Second Conference on Language Modeling (COLM’25). [24]Gijs Hendriksen, Djoerd Hiemstra, and Arjen P. de Vries. 2025. Selective Search as a First-Stage Retriever. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 16th International Conference of the CLEF Association, CLEF 2025, Madrid, Spain, September 9–12, 2025, Proceedings (Madrid, Spain). Springer-Verlag, Berlin, Heidelberg, 17–33. https://doi.org/10.1007/978-3-032-04354-2_2 [25] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (Oct. 2002), 422–446. https://doi. org/10.1145/582415.582418 [26] Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664 [cs.CL] https://arxiv.org/ abs/2509.04664 [27]Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714 [cs.CL] https://arxiv.org/abs/2310.03714 [28]Gun Il Kim, Jong Wook Kim, and Beakcheol Jang. 2025. UniRAG: A Unified RAG Framework for Knowledge-Intensive Queries with Decomposition, Break- Down Reasoning, and Iterative Rewriting. In Findings of the Association for Computational Linguistics: EMNLP 2025. 18795–18810. [29] Reno Kriz, Kate Sanders, David Etter, Kenton Murray, Cameron Carpenter, Kelly Van Ochten, Hannah Recknor, Jimena Guallar-Blasco, Alexander Mar- tin, Ronald Colaianni, Nolan King, Eugene Yang, and Benjamin Van Durme. 2025. MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval. arXiv:2410.11619 [cs.CV] https://arxiv.org/abs/2410.11619 [30]Weronika Lajewska and Krisztian Balog. 2025. GINGER: Grounded Information Nugget-Based Generation of Responses. In Proceedings of the 48th International ACM SIGIR Conference (SIGIR ’25). https://krisztianbalog.com/files/sigir2025- ginger.pdf SIGIR 2025 paper. [31]Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE. arXiv preprint arXiv:2403.06789 (2024). [32]Victor Lavrenko and W Bruce Croft. 2017. Relevance-based language models. In ACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 260–267. [33] Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, and Eugene Yang. 2025. Overview of the TREC 2024 NeuCLIR Track. arXiv:2509.14355 [cs.IR] https://arxiv.org/abs/2509.14355 [34]Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172 [35]Xueguang Ma, Luyu Gao, Shengyao Zhuang, Jiaqi Samantha Zhan, Jamie Callan, and Jimmy Lin. 2025. Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality. arXiv:2505.02466 [cs.IR] https://arxiv.org/abs/ 2505.02466 10 [36]Sean MacAvaney, Craig Macdonald, and Iadh Ounis. 2021. Streamlining Evalua- tion with ir-measures. arXiv:2111.13466 [cs.IR] https://arxiv.org/abs/2111.13466 [37]Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Rec- knor, Eugene Yang, Francis Ferraro, and Benjamin Van Durme. 2025. WikiVideo: Article Generation from Multiple Videos.arXiv:2504.00939 [cs.CV] https: //arxiv.org/abs/2504.00939 [38] Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, and Benjamin Van Durme. 2025.Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Genera- tion. arXiv:2510.24870 [cs.CL] https://arxiv.org/abs/2510.24870 [39]James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, and Noah Hibbler. 2024. On the Evaluation of Machine- Generated Reports. In Proceedings of the 47th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval (SIGIR 2024). ACM, 1904–1915. https://doi.org/10.1145/3626772.3657846 [40]Teague McMillan, Gabriele Dominici, Martin Gjoreski, and Marc Langheinrich. 2025. Towards Transparent Reasoning: What Drives Faithfulness in Large Lan- guage Models? arXiv:2510.24236 [cs.CL] https://arxiv.org/abs/2510.24236 [41]Federico Nanni, Bhaskar Mitra, Matt Magnusson, and Laura Dietz. 2017. Bench- mark for Complex Answer Retrieval. arXiv:1705.04803 [cs.IR] https://arxiv.org/ abs/1705.04803 [42]Federico Nanni, Bhaskar Mitra, Matt Magnusson, and Laura Dietz. 2017. Bench- mark for Complex Answer Retrieval. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (Amsterdam, The Netherlands) (ICTIR ’17). Association for Computing Machinery, New York, NY, USA, 293–296. https://doi.org/10.1145/3121050.3121099 [43] Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, and Andrew Yates. 2025. Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector. arXiv:2510.00671 [cs.IR] https://arxiv.org/abs/2510.00671 [44]Paul Over. 2001. The TREC interactive track: an annotated bibliography. Inf. Process. Manage. 37, 3 (May 2001), 369–381.https://doi.org/10.1016/S0306- 4573(00)00053-4 [45]Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track. arXiv:2406.16828 [cs.IR] https://arxiv.org/abs/2406.16828 [46]Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. 2024. Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework. arXiv:2411.09607 [cs.IR] https://arxiv.org/abs/2411.09607 [47]Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV] https://arxiv.org/ abs/2103.00020 [48]Arun Reddy, Alexander Martin, Eugene Yang, Andrew Yates, Kate Sanders, Ken- ton Murray, Reno Kriz, Celso M. de Melo, Benjamin Van Durme, and Rama Chellappa. 2025. Video-ColBERT: Contextualized Late Interaction for Text-to- Video Retrieval. arXiv:2503.19009 [cs.CV] https://arxiv.org/abs/2503.19009 [49] Stephen Robertson. 2008. A new interpretation of average precision. In Pro- ceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 689–690. [50] Stephen Robertson, Hugo Zaragoza, et al.2009. The probabilistic relevance framework: BM25 and beyond. Foundations and trends® in information retrieval 3, 4 (2009), 333–389. [51]Saron Samuel, Dan DeGenaro, Jimena Guallar-Blasco, Kate Sanders, Oluwaseun Eisape, Tanner Spendlove, Arun Reddy, Alexander Martin, Andrew Yates, Eugene Yang, Cameron Carpenter, David Etter, Efsun Kayi, Matthew Wiesner, Kenton Murray, and Reno Kriz. 2025. MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion. arXiv:2503.20698 [cs.CV] https://arxiv.org/abs/2503. 20698 [52]Tyler Skow, Alexander Martin, Benjamin Van Durme, Rama Chellappa, and Reno Kriz. 2026. RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval. arXiv:2602.02444 [cs.IR] https://arxiv.org/abs/2602.02444 [53]Ian Soboroff, Donna Harman, et al.2003. Overview of the TREC 2003 Novelty Track.. In TREC. 38–53. [54] Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. 2025. Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges.arXiv:2504.15205 [cs.CL] https://arxiv.org/abs/2504.15205 [55]Nandan Thakur, Ronak Pradeep, Shivani Upadhyay, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, and Jimmy Lin. 2025. Assessing Support for the TREC 2024 RAG Track: A Large-Scale Comparative Study of LLM and Human Evaluations. In Proceedings of the 48th International ACM SIGIR Confer- ence on Research and Development in Information Retrieval (Padua, Italy) (SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 2759–2763. https://doi.org/10.1145/3726302.3730165 [56]William Walden, Orion Weller, Laura Dietz, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, and Eugene Yang. 2025. Auto-ARGUE: LLM-Based Report Generation Evaluation. arXiv preprint arXiv:2509.26184 (2025). [57]Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Ben- jamin Van Durme. 2025. Rank1: Test-Time Compute for Reranking in Information Retrieval. arXiv:2502.18418 [cs.IR] https://arxiv.org/abs/2502.18418 [58]Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. Qwen2.5-Omni Technical Report. arXiv:2503.20215 [cs.CL] https://arxiv.org/abs/2503.20215 [59]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388 [60]Diji Yang, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Jie Yang, and Yi Zhang. 2024. Im-rag: Multi-round retrieval-augmented generation through learning inner monologues. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 730–740. [61]Diji Yang, Linda Zeng, Jinmeng Rao, and Yi Zhang. 2025. Knowing You Don’t Know: Learning When to Continue Search in Multi-round RAG through Self- Practicing. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1305–1315. [62]Eugene Yang, Dawn Lawrie, James Mayfield, Douglas W. Oard, and Scott Miller. 2024. Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation. In Proceedings of the 46th European Conference on Information Retrieval (ECIR). https://arxiv.org/abs/2401.04810 [63] Eugene Yang, Dawn Lawrie, Orion Weller, and James Mayfield. 2025. HLTCOE at TREC 2024 NeuCLIR Track. arXiv:2510.00143 [cs.CL] https://arxiv.org/abs/ 2510.00143 [64]Eugene Yang, Andrew Yates, Dawn Lawrie, James Mayfield, and Trevor Adri- aanse. 2026. RoutIR: Fast Serving of Retrieval Pipelines for Retrieval-Augmented Generation. arXiv:2601.10644 [cs.IR] https://arxiv.org/abs/2601.10644 [65]Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL] https://arxiv.org/abs/2506.05176 [66]Mingjun Zhao, Shengli Yan, Bang Liu, Xinwang Zhong, Qian Hao, Haolan Chen, Di Niu, Bowei Long, and Weidong Guo. 2021. QBSUM: A large-scale query-based document summarization dataset from real-world applications. Computer Speech & Language 66 (March 2021), 101166. https://doi.org/10.1016/j.csl.2020.101166 [67]Wei Zheng, Xuanhui Wang, Hui Fang, and Hong Cheng. 2012. Coverage-based search result diversification. Inf. Retr. 15, 5 (Oct. 2012), 433–457. https://doi.org/ 10.1007/s10791-011-9178-4 [68]Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, and Li Yuan. 2024. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv:2310.01852 [cs.CV] https://arxiv.org/abs/2310.01852 11