Paper deep dive
Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers
Saron Samuel, Benjamin Van Durme, Eugene Yang
Abstract
Abstract:While reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they perform on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247-1.000). Demographic breakdown analysis revealed fairness gaps for geographic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations preserve the fairness characteristics of their input ranking.
Tags
Links
- Source: https://arxiv.org/abs/2603.10332v1
- Canonical: https://arxiv.org/abs/2603.10332v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 98%
Last extracted: 3/13/2026, 1:09:27 AM
Summary
This paper presents the first systematic comparison of fairness between reasoning and non-reasoning rerankers in information retrieval. Using the TREC 2022 Fair Ranking Track dataset, the authors evaluate six models across various demographic attributes. The findings indicate that current reasoning rerankers neither improve nor harm fairness compared to non-reasoning approaches, with fairness metrics (AWRF) remaining stable across models. While query formulation significantly impacts relevance, it does not alter fairness, and demographic gaps—particularly in subject geography—persist regardless of model architecture.
Entities (6)
Relation Signals (3)
Reasoning Rerankers → comparedto → Non-Reasoning Rerankers
confidence 100% · We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers.
AWRF → measures → Fairness
confidence 100% · Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable
Rank1 → evaluatedon → TREC 2022 Fair Ranking Track
confidence 95% · Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models
Cypher Suggestions (2)
Identify relations between models and the dataset · confidence 95% · unvalidated
MATCH (m:Reranker)-[r:EVALUATED_ON]->(d:Dataset) RETURN m.name, d.name
Find all reranker models evaluated in the study · confidence 90% · unvalidated
MATCH (m:Reranker) RETURN m.name, m.type
Full Text
46,980 characters extracted from source content.
Expand or collapse full text
Does Reasoning Make Search More Fair? Comparing Fairness in Reasoning and Non-Reasoning Rerankers Saron Samuel 1[0009−0002−1827−2023] , Benjamin Van Durme 1,2[0000−0003−4328−4288] , and Eugene Yang 1,2[0000−0002−0051−1535] 1 Johns Hopkins University 2 Human Language Technology Center of Excellence ssamue21@jhu.edu, vandurme@jhu.edu, eugene.yang@jhu.edu Abstract. While reasoning rerankers, such as Rank1, have demonstrated strong abilities in improving ranking relevance, it is unclear how they per- form on other retrieval qualities such as fairness. We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. Using the TREC 2022 Fair Ranking Track dataset, we evaluate six reranking models across multiple retrieval settings and demographic attributes. Our findings demonstrate reasoning neither improve nor harm fairness compared to non-reasoning approaches. Our fairness metric, Attention-Weighted Rank Fairness (AWRF) remained stable (0.33-0.35) across all models, even as relevance varies substantially (nDCG 0.247- 1.000). Demographic breakdown analysis revealed fairness gaps for geo- graphic attributes regardless of model architecture. These results indicate that future work in specializing reasoning models to be aware of fairness attributes could lead to improvements, as current implementations pre- serve the fairness characteristics of their input ranking. Keywords: reasoning reranker· LLM reranking· retrieval fairness· group fairness 1 Introduction Search systems play a critical role in shaping how we access information since their rankings determine which perspectives gain visibility. While surfacing rel- evant information usually yields better utility in tasks [7], ensuring a compre- hensive coverage of different opinions and sources is critical when information systems are used for decision making [20].Recently, as reasoning large language models (LLMs) have shown strong effectiveness in various tasks through test- time reasoning [2], rerankers based on these reasoning models can reason about a ranking, generating thoughts and justifications before producing a final rank- ing. These reasoning rerankers, such as Rank1 [28], Qwen3-Reranker [31], and ReasonRank [13], have demonstrated strong improvements in relevance metrics across several benchmarks, such as BEIR [26], NeuCLIR [10] and BRIGHT [25]. While reasoning rerankers may produce more relevant rankings, how they impact fairness in a ranking remains underexplored. Fairness in ranking can be arXiv:2603.10332v1 [cs.IR] 11 Mar 2026 2S. Samuel et al. Table 1. Qualitative comparison of subject geography between a BM25 initial retrieval, then reranked by Rank1, and the ground truth Oracle rankings for the query “basic overview of sailing and types of sailboats” using the Fair Ranking 2022 document col- lection. These were marked by hand annotation (country level), and closely followed the original annotations (continent level). Sections with – mean that the subject geography could not be identified. BM25Rank1Oracle Ranking TitleOrigin TitleOriginTitleOrigin Outline of sailing –Boating–Seafarer 37USA Ultimate 20USAOutline of sailing–Stefan KrookSweden Moore 30USASailboat–Jali MäkiläFinland Catalina 275 Sport USAPuffer (dinghy)USAYoav OmerIsrael Ranger 22USALaserPerformanceUK/USAHumber KeelUK Catalina 16.5USASandeqIndonesiaHelsen 22USA CS 44Canada MC ScowUSABrandworkers Int’l USA Nonsuch 36CanadaKaepPalauDavid BarnesNew Zealand Hunter 45 DSUSAPearson 26USAJoanna Burzyńska Poland Hunter 33.5USAInclusion Catamaran PortugalNewport 214USA operationalized as an equitable distribution of exposure across groups defined by sensitive attributes (such as gender, age, occupation, geography) while still maintaining retrieval effectiveness [5]. For example, a search system might over- expose content about men when women are equally relevant. On one hand, reasoning could help by encouraging the model to deliberate more deeply about context. On the other hand, the reasoning process itself is shaped by pretraining data and internal biases [11]. If those biases are reflected in the generated justifications, reasoning may instead amplify unfairness by pro- ducing confidently skewed rationales. Consider a search query “basic overview of sailing and types of sailboats.” Searching over the Fair Ranking 2022 track document collection, Table 1 shows, all three rankings consist of relevant documents. BM25 [21] retrieval surfaces mainly USA-based articles (7 of 10), while an oracle ranking constructed from relevance annotations achieves better geographic diversity with articles from Sweden, Finland, Israel, UK, Poland, and New Zealand. Rank1 [28], a pointwise reasoning reranker, reranks on top of BM25, also outputs a diverse ranked list with different document origins. To investigate this, we conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers. We want to answer the following research questions: RQ1 How does query formulation impact fairness and relevance in both reasoning and non-reasoning rerankers? RQ2 Do reasoning rerankers exhibit different fairness characteristics compared to non-reasoning rerankers? RQ3 How does fairness vary across different demographic attributes for reasoning vs non-reasoning rerankers? Does Reasoning Make Search More Fair?3 This work examines exposure-based group fairness, where documents from different demographic groups are expected to receive visibility proportional to their representation among relevant documents and to real-world population statistics. We assess fairness using Attention Weighted Rank Fairness (AWRF), which quantifies how closely the position-weighted exposure distribution across protected groups matches a target fair distribution. Using the TREC 2022 Fair Ranking Track dataset [5], we evaluate models across multiple sensitive attributes: age, alphabetical (topic-based), gender, lan- guages, occupation, popularity, source geography, and subject geography. We experiment with two query settings, original keyword queries, and rewritten queries generated via gpt-4o-mini [17]. Our contributions are as follows: – We conduct the first systematic comparison of fairness between reasoning and non-reasoning rerankers for information retrieval. – We demonstrate that reasoning capabilities, as currently implemented, nei- ther improve nor harm fairness compared to non-reasoning approaches. – We show that query formulation substantially impacts both relevance and fairness, with natural language queries improving effectiveness for all models. – We find demographic differences in fairness, particularly subject geography attribute consistently showing 10-15% lower fairness than other attributes across all models and retrieval settings. 2 Background 2.1 Reasoning vs. Non-Reasoning Rerankers Rerankers refine an initial list of candidate documents (from BM25 [21] or dense retrieval like Qwen3 [31]) by re-evaluating their relevance to a given query. Tra- ditional rerankers learn implicit notions of relevance from labeled pairs, while recent approaches integrate explicit reasoning to simulate step-by-step judgment at query time. We distinguish between reasoning and non-reasoning rerankers. Non-reasoning rerankers predict a scalar relevance score or an ordered list of document id without generating explicit reasoning chains [16, 18]. Reason- ing rerankers, such as Rank1 [28], generate intermediate reasoning steps before assigning a relevance score. These models aim to emulate human-like think- ing trained with distillation from larger reasoning models such as DeepSeek R1 [28, 29] or explicit preference optimization [32]. This difference makes rea- soning rerankers a good test case for fairness. 2.2 Fairness in Ranking Fairness in ranking has been a subject long studied, from resume ranking for jobs [6] to recommendation systems [1]. The challenge of balancing relevance with 4S. Samuel et al. fairness has been approached from multiple perspectives, including individual fairness [4], group fairness [30], and exposure-based fairness [3, 24]. Early work on fair ranking focused on ensuring that qualified candidates re- ceive appropriate representation in ranked lists. Zehlike et al. [30] introduced FA*IR, an algorithm that ensures minimum representation of protected groups. Singh and Joachims [24] proposed fairness of exposure metrics, arguing that items should receive exposure proportional to their relevance. Diaz et al. [3] ex- tended this work by introducing expected exposure metrics for stochastic rank- ing policies. Wang et al. [27] conducted a study on fairness of LLMs as rankers, finding that LLMs can exhibit demographic biases in ranking decisions. 2.3 TREC Fair Ranking Track The TREC Fair Ranking Track [5] introduced a benchmark for studying how retrieval systems can balance relevance and fairness in document exposure. We evaluate static ranked lists using normalized Discounted Cumulative Gain (nDCG) for relevance and Attention-Weighted Rank Fairness (AWRF) [19, 22] for fair- ness. AWRF measures how closely the exposure of documents across protected groups (gender, occupation, geography, etc.) matches a target distribution that combines real-world population data with the distribution among relevant ar- ticles. AWRF compares the actual exposure distribution with a target fair dis- tribution using the Jensen-Shannon divergence, d JS [15]. AWRF is calculated as: AWRF(L) = 1− d JS (d L ,d q )(1) whered L is the normalized, position-weighted exposure distribution across groups in ranking L andd q is the target distribution based on group representation in relevant documents and global demographics for query q. AWRF ranges from 0 to 1, with 1 indicating perfect fairness. The final system score, M 1 , combines both AWRF and nDCG as: M 1 (L) = AWRF(L)× nDCG(L) rewarding systems that maintain both metrics. The query texts were made from extracted keywords from articles relevant to a WikiProject using KeyBERT [8]. For each WikiProject, keywords were ag- gregated from relevant articles and manually filtered to form the final query texts. The corpus consisted of articles from English Wikipedia. Relevance was obtained from existing WikiProject page lists. If a WikiProject had tagged an article as being within its scope, that article was considered relevant for queries representing that WikiProject. The dataset includes multiple fairness categories derived from Wikipedia and Wikidata metadata that are later used to measure AWRF. This includes geographic location, gender for biographical articles, occu- pation, article age, alphabetical position, article creation date, popularity based on pageviews, and cross-language replication. This track provides an important pre-reasoning baseline for understanding fairness trade-offs in retrieval systems and serves as the basis for our comparison. Does Reasoning Make Search More Fair?5 With this foundation, we design a set of experiments to investigate how reasoning and non-reasoning rerankers compare. 3 Experiment Protocol We design our set of experiments to address our research questions. RQ1 Query Formulation. We evaluate all reranking models with either original keyword queries from Fair Ranking 2022 or rewritten queries. This is done with BM25 as an initial retriever. RQ2 Reasoning vs. Non-Reasoning. We compare six rerankers. Three are reasoning models (Rank1, Qwen3-Reranker, ReasonRank) and three are non reasoning (MonoT5, RankZephyr, RankLLaMA). RQ3 Demographic Breakdown. We compute M 1 separately for the eight de- mographic attributes (age, alphabetic, gender, languages, occupation, pop- ularity, source geography, subject geography) across all retrieval settings. 3.1 Initial Retrieval Settings Before applying rerankers, we established four initial retrieval settings. – BM25 with keyword queries – BM25 with rewritten queries – Qwen3-Embedding-8B with rewritten queries – RRF fusion combining BM25 (rewritten) and Qwen3 rankings Queries are rewritten using gpt-4o-mini with the original query title plus a subset of four randomly selected keywords from the list, which are then used along side a prompt. We rewrite queries because LLM-based rerankers are de- signed to process natural language rather than keyword lists. Query rewriting transforms keyword collections ("Sailing, ocean, sailboat, paddle") into a coher- ent query ("basic overview of sailing and types of sailboats"), better matching how users formulate real search queries. This allows us to evaluate whether query formulation impacts fairness independently of model architecture. We also constructed an oracle ranking that approximates an upper bound on relevance by reordering documents to achieve a nDCG of 0.9 for the top 500 documents. This is based on the dataset’s gold judgments. This oracle setting serves as a reference for how much fairness remains even when retrieval relevance is nearly optimal. The oracle retrieval setting is produced via the TREC 2022 Fair Ranking Track provided script. 3 3.2 Rerankers For reranking, we select reankers with roughly the same number of parameters to eliminate the effect of different model sizes. 3 https://github.com/fair-trec/trec2022-fair-public/blob/main/ oracle-runs.py 6S. Samuel et al. Listwise Rerankers – RankZephyr-7B-V1 [18] (refered to as RankZephyr-7B in tables) is zero- shot listwise reranker built on a 7B Zephyr backbone and instruction fine- tuned by distilling teacher reorderings from RankGPT3.5 and a smaller set of RankGPT4 outputs. – ReasonRank-7B [13] is a reasoning listwise reranker built on Qwen2.5-7B- Instruct. It is trained through a two-stage framework combining supervised fine-tuning on synthesized reasoning-intensive data and reinforcement learn- ing with multi-view ranking rewards. Pointwise Rerankers – RankLLaMA-7B [14] is a pointwise reranker fine-tuned from the LLaMA- 2-7B architecture for multi-stage text retrieval. The model takes query- document pairs as input and projects the final hidden state of the end- of-sequence token through a linear layer to produce scalar relevance scores. RankLLaMA is trained using contrastive loss with hard negatives sampled from RepLLaMA’s top ranking results. – MonoT5-base-msmarco-10k [16] (refered to as MonoT5-0.3B in tables) is a pointwise reranker based on the T5-base encoder-decoder architecture. It is fine-tuned on the MS MARCO passage ranking dataset for 10,000 steps to predict document relevance as a text generation task. It produces “true” or “false” tokens for each query-document pair and converts the logits of these generated tokens into relevance probabilities for ranking. – Qwen3-Reranker-8B [31] is a cross-encoder reranker built on the Qwen3 language model architecture, utilizing 8 billion parameters to assess query- document relevance through joint encoding of query-document pairs. – Rank1-7B [28] is a pointwise reasoning reranker distilled from DeepSeek- R1’s reasoning traces on MS MARCO, trained via supervised fine-tuning on 635,000 examples of query-passage relevance judgments with step-by-step reasoning chains. The model is built on Qwen 2.5 base models and uses test- time compute to generate explainable reasoning before producing binary relevance predictions. We evaluated these reasoning and non-reasoning rerankers across our four initial retrieval and oracle settings. We rerank the top 500 of each ranking. Note, the rewritten queries and reranker prompts are all neutral, without introducing the concept of fairness. 3.3 Metrics and Statistical Tests In this study, we report nDCG@10 for relevance and AWRF@10 for fairness. Instead of evaluating the full ranking, which is what the Fair Ranking track originally proposed, we use a shallow rank cutoff to better evaluate the reranking models. Does Reasoning Make Search More Fair?7 When testing for differences in either nDCG and AWRF, we conduct paired t-tests on the query level, where the null hypothesis is that methods are identical (H 0 : x 1 = x 2 ). Alternatively, when testing for equivalence, we use the paired Two One-Sided Tests (TOST) [23], where one tests if the difference is greater than the lower bound (−δ) and another tests if it is less than the upper bound (+δ). If both tests reject their respective null hypotheses at a certain significance level (typically 0.05), we conclude the methods are statistically equivalent within the tolerance δ (H 0 :−δ≤ x 1 −x 2 ≤ δ). We set δ = 0.05 for all TOSTs conducted in this study. We apply the Holm-Bonferroni method [9] for multiple testing correction when rejecting multiple statistical tests in one claim. This applies to both the paired t-tests and the paired TOSTs. 4 Results and Analysis 4.1 Relevance and Fairness Comparison between Rerankers To answer RQ1, “how does query formulation impact fairness and relevance in both reasoning and non-reasoning rerankers?”, we compare performance using keywords versus rewritten queries. We measure relevance using nDCG@10 and fairness using AWRF@10 4 . Since LLM-based rerankers are designed to process natural language queries, we first explore the performance differences between using the keywords and rewritten queries. Based on the left two sections in Table 2, all rerankers achieve substantially higher nDCG@10 when searching with rewritten queries using BM25 compared to keywords, indicating that our zero-shot rewriting process does not introduce significant query drift. While almost all rerankers are significantly better in nDCG@10 than the initial BM25 ranking, regardless of the queries, except for RankLLaMA (paired t-test with p < 0.05 and multiple testing correction), with strong relative im- provements from the initial ranking. Interestingly, reranking does not seem to affect the relative group exposure at the top of the ranking, even when we are reranking the top 500 documents from the initial ranking. To further confirm this hypothesis, we conduct TOST and verified that all rerankers demonstrate equivalent AWRF@10 to the initial retrieval (p < 0.05 with multiple testing correction). Query formulation, when effective, clearly leads to improved relevance as measured by nDCG@10 for both BM25 and subsequent reranking. However, it does not change fairness, resulting in stationary AWRF@10 scores. We observe these trends on both reasoning and non-reasoning rerankers. Summarized in the right parts of Table 2, the Qwen3 and the fusion ini- tial rankings both have a lower nDCG@10 compared to the BM25 using the rewritten queries (0.540). Again, all rerankers demonstrate equivalent group ex- posure at the top of the ranks from the initial retrieval results measured by 4 we ran the same evaluation @20 and reached the same conclusion 8S. Samuel et al. Table 2. nDCG@10 and AWRF@10 for BM25, Qwen3-Embedding-8B, and various rerankers across keyword, rewritten, and fusion queries. The (R) and (N) prefix indi- cates reasoning and non-reasoning models. KeywordsGPT-4o-mini Rewritten Queries BM25BM25Qwen3-8BFusion nDCG AWRFnDCG AWRFnDCG AWRFnDCG AWRF Initial Retrieval0.247 0.3380.540 0.3360.311 0.3470.416 0.344 Listwise Rerankers (N) RankZephyr-7B0.567 0.3330.760 0.3380.726 0.3380.765 0.336 (R) ReasonRank-7B0.455 0.3370.712 0.3380.680 0.3410.747 0.339 Pointwise Rerankers (N) RankLLaMA-7B0.271 0.3330.526 0.3370.656 0.3400.588 0.337 (N) MonoT5-0.3B0.391 0.3380.721 0.3430.717 0.3410.713 0.342 (R) Qwen3-Reranker-8B0.373 0.3380.617 0.3370.503 0.3420.605 0.343 (R) Rank1-7B0.503 0.3330.721 0.3440.693 0.3360.763 0.343 AWRF@10 (TOST with p < 0.05 and multiple testing correction). Among the pointwise rerankers, MonoT5, while being the smallest, performs the strongest in nDCG@10 (0.717) when reranking Qwen3’s initial ranking, but is slightly worse than Rank1 on reranking fusion. Pointwise rerankers all reduce the AWRF score after reranking compared to the initial Qwen3 and fusion retrieval results. Again, all rerankers demonstrate equivalent AWRF@10 to their respective initial retrieval results (TOST with p < 0.05 and multiple testing correction). Among the pointwise rerankers, MonoT5, while being the smallest model, achieves the strongest nDCG@10 (0.717) when reranking Qwen3’s initial ranking, but per- forms slightly worse than Rank1 on reranking fusion results (0.713 vs. 0.763). Pointwise rerankers all show lower AWRF@10 after reranking compared to the initial Qwen3 and fusion retrieval results. While the signals are relatively weak, this suggests these pointwise rerankers only consider relevance in the scores they produce, which aligns with their training objectives. Across all three initial rankings using rewritten queries (BM25, Qwen3, and fusion), listwise rerankers consistently provide stronger effectiveness in nDCG@10, with the non-reasoning RankZephyr being slightly better than the reasoning counterpart ReasonRank. Although still lower than the initial retrieval AWRF@10, ReasonRank shows slightly numerically better AWRF@10 when reranking Qwen3 (0.341) and fusion (0.339) compared to RankZephyr (0.338 and 0.336), indicat- ing that the reasoning model may begin considering document diversity when documents are equally relevant. However, while capable of comparing across doc- uments and considering aspects beyond just relevance signals, these two listwise rerankers do not effectively consider other signals, even when documents are equally relevant, resulting in lower AWRF@10 than the initial ranking. To answer RQ2 (“Do reasoning-based rerankers exhibit different fairness char- acteristics compared to non-reasoning rerankers?”), we examine fairness patterns Does Reasoning Make Search More Fair?9 in the main results, then isolate relevance and fairness factors using an oracle ranking experiment. The difference in AWRF@10 between initial retrieval and reranked results can be attributed to a combination of two factors: (1) inability to rank relevant documents to the top, and (2) inability to identify alternative relevant docu- ments with different demographics. Comparing reasoning versus non-reasoning rerankers across all three initial rankings (BM25, Qwen3, fusion) in Table 2, we observe minimal differences in AWRF@10. For example, when reranking fusion results, ReasonRank shows AWRF@10=0.339 versus RankZephyr’s 0.336, nu- merically similar values, and both lower than the initial retrieval AWRF@10 of 0.344. To isolate whether rerankers can consider fairness when relevance is con- trolled, we construct an oracle ranking of 500 documents that achieves full nDCG of 0.9 (with nDCG@10 of 0.886) and rerank with all tested rerankers. Summarized in Table 3, all rerankers achieve near-perfect nDCG@10 except for RankLLaMA. Since there are generally more than 10 relevant documents in each query, these results demonstrate that rerankers successfully place almost only relevant documents in the top 10 ranking. With near-perfect relevance achieved by all rerankers (nDCG@10≥ 0.923 except RankLLaMA), we can isolate the fairness behavior. Despite all rerankers demonstrating lower or equivalent AWRF@10 compared to the oracle initial retrieval (0.352), we observe weak trends in different directions. Pointwise rea- soning models show slightly higher AWRF@10 (Qwen3-Reranker: 0.350, Rank1: 0.348) compared to non-reasoning pointwise models (MonoT5: 0.345, RankL- LaMA: 0.345). This implies that when documents are all equally relevant, the pointwise scores produced by reasoning models may have started to convey in- formation beyond pure relevance. However, for the two listwise rerankers, the trend reverses, RankZephyr shows slightly higher AWRF@10 (0.353) compared to ReasonRank (0.350). We hypoth- esize that this may be due to the ranking instruction tuning that explicitly pushes the model to consider only relevance signals. Since reasoning listwise models are more capable of following instructions, the training instruction may be amplified in the final reranking results. There is no strong evidence that reasoning-based rerankers exhibit different fairness characteristics compared to non-reasoning rerankers. While we observe weak trends in the oracle experiment, with pointwise reasoning models showing slightly better AWRF@10 but listwise reasoning models showing slightly worse AWRF@10, these patterns require further investigation to confirm. This equivalence across architectures can be explained by several factors. First, none of the rerankers were explicitly trained to optimize for fairness, they were trained solely on relevance judgments from datasets like MS MARCO. Without fairness-aware training objectives or prompts, rerankers lack incentive to consider demographic attributes. Second, demographic information (especially geography) is often not explicitly mentioned in document text, making it difficult for rerankers to condition on these attributes even implicitly. Third, rerankers 10S. Samuel et al. Table 3. Oracle results for all rerankers and retrieval baselines. nDCG AWRF Oracle Initial Retrieval0.8860.352 Listwise Rerankers (N) RankZephyr-7B0.9720.353 (R) ReasonRank-7B 1.0000.350 Pointwise Rerankers (N) RankLLaMA-7B0.8250.345 (N) MonoT5-0.3B1.0000.345 (R) Qwen3-Reranker 0.9230.350 (R) Rank1-7B0.9960.348 operate on fixed candidate pools from initial retrieval. So if the initial ranking has limited demographic diversity in the top 500 documents, reranking cannot introduce diversity not present in the pool. The reasoning process in models like Rank1 focuses on query-document relevance matching rather than demographic considerations, explaining why reasoning provides no fairness advantage over non-reasoning approaches under current training paradigms. 4.2 Some Fairness Attributes are Disproportionately Represented Table 4 provides attribute-level breakdowns across the eight sensitive attributes on M 1 , which is the official metric used in the 2022 Fair Ranking track. The nDCG@10 values are close to 1.0 as we used the Oracle rankings, so M 1 @10 is essentially AWRF@10. Despite only showing the results of reranking the oracle initial rankings, we observe very similar trends in reranking other initial rank- ings. However, since reranking the oracle initial rankings provides more candi- date relevant documents for the rerankers to choose from, it provides cleaner experimental results to analyze. Across all retrieval settings, certain attributes consistently achieved higher fairness scores than others. Languages, gender, and age attributes typically scored highest across all rerankers and initial rankings. The alphabetical bias (the attribute of biasing toward alphabetical order since it is likely the default processing order of documents) is particularly interesting, where most recent LLM-based rerankers are trying to combat the positional biases [12]. While the oracle initial retrieval has roughly the same AWRF on gender and alphabets, all rerankers exhibit substantially higher AWRF on alphabets than on gender. The attributes that are systematically lower in AWRF, such as geographical locations, are ones that are less likely to be represented in the text, which is what the rerankers only considered. If these attributes are crucial in the applications but are not incorporated into the retrieval pipeline, it is unlikely that the models will provide a fair ranking on these attributes. This indicates a fundamental limitation that certain demographic groups are underrepresented even among Does Reasoning Make Search More Fair?11 Table 4. Overall and attributeM 1 of each reranker. Since AWRF is a distributional distance measurement, overallM 1 is not an average over each attribute. ModelM 1 langs gender alpha age pop occ src-geo sub-geo Oracle Retrieval0.3120.873 0.868 0.867 0.864 0.866 0.854 0.797 0.761 Listwise Rerankers on oracle (N) RankZephyr-7B0.3430.921 0.901 0.926 0.906 0.897 0.893 0.851 0.801 (R) ReasonRank-7B0.3500.941 0.926 0.937 0.930 0.913 0.916 0.864 0.820 Pointwise Rerankers on oracle (N) RankLLaMA-7B0.2870.816 0.808 0.814 0.813 0.804 0.797 0.758 0.722 (N) MonoT5-0.3B0.3450.959 0.901 0.952 0.944 0.936 0.891 0.890 0.827 (R) Qwen3-Reranker 0.3230.905 0.891 0.900 0.898 0.896 0.884 0.836 0.800 (R) Rank1-7B0.3460.952 0.936 0.968 0.937 0.915 0.924 0.887 0.829 highly relevant documents, since this information is generally harder to capture in the text. Comparing different rerankers, reasoning rerankers are generally better than the non-reasoning counterparts with the exception of MonoT5, which is compa- rable with Rank1. So to answer RQ3: How does fairness vary across different demographic at- tributes for reasoning vs non-reasoning rerankers, there are strong differences between the attributes for all rerankers, likely due to the availability of the in- formation in text. However, there are weaker differences between reasoning and non-reasnoing rerankers, with reasnoing ones being slightly more fair. 5 Conclusion and Future Work This work presents the first systematic comparison of fairness between reason- ing and non-reasoning rerankers across multiple retrieval settings and demo- graphic attributes. Using the TREC 2022 Fair Ranking Track, we evaluated six reranking models across four retrieval configurations and 1 oracle setting: BM25 with keyword queries, BM25 with rewritten queries, Qwen3-8B dense retrieval, BM25+Qwen3-8B fusion, and oracle procduced ranking targeting nDCG = 0.9. Our primary finding is that reasoning capabilities, as currently implemented in LLM-based rerankers do not improve or harm fairness compared to non- reasoning approaches. Attribute-level analysis revealed interesting differences between demographics. Geographic attributes, particularly subject geography, consistently showed the lowest fairness scores. Even in the oracle setting with near-perfect relevance, subject geography peaked at 0.829, compared to 0.968 for alphabetic attributes and 0.959 for languages. These patterns held for both reasoning and non-reasoning rerankers. Improving search fairness requires interventions beyond reranking such as further diversifying document collections, auditing representational gaps, and 12S. Samuel et al. designing retrieval strategies that actively surface underrepresented perspectives. Natural language query understanding deserves continued investment, as query formulation influences both relevance and fairness more than reranking architec- ture. Limitations include our focus on exposure-based fairness (AWRF), which may not capture other fairness attributes such as calibration or intersectional fairness. We do not examine the content of reasoning justifications, which could reveal subtle biases not captured by ranking metrics. Our evaluation is confined to English-language documents and specific demographic attributes. Our evaluation uses AWRF with TREC 2022’s specific target distribution formulation, which equally weights empirical relevant-document demographics and world population statistics. The stability of AWRF across rerankers might not hold under alternative fairness definitions that place different weights on these components or that enforce stricter demographic parity constraints. Future work should extend fairness analysis to other contexts, examine whether reasoning can be steered toward more equitable outcomes, and develop meth- ods for improving collection diversity alongside algorithmic fairness. Reasoning rerankers offer strong relevance improvements without fairness degradation but cannot overcome representation gaps in the information ecosystems they operate within. 5.1 Theory of Change When search systems underrepresent certain demographic groups, they reinforce information inequalities. As LLM based reasoning rerankers become increasingly deployed in production systems, understanding how they impact fairness is im- portant to prevent amplifying demographic biases at scale. Our work compares fairness outcomes between reasoning and non-reasoning rerankers across multiple retrieval settings and demographic attributes. For this work to achieve its desired impact, several preconditions must hold. Fairness metrics like AWRF must become standard in retrieval evaluation pipelines since reranking for relevance alone cannot solve fairness. Our oracle experiments show that even near perfect relevance prediction does not also achieve signifi- cantly higher fairness. Meaningful progress requires addressing upstream issues such as diversifying content sources and implementing retrieval strategies that actively seek a wider coverage of aspects/perspectives. Potential negative externalities include false complacency. Our findings might be misinterpreted as evidence that modern rerankers are “fair enough,” discour- aging ongoing auditing when “not making things worse” differs from “making things fair.” Overreliance on AWRF as the sole fairness metric risks overlooking other attributes like quality of representation or intersectional fairness. Disclosure of Interests The authors have no competing interests to declare that are relevant to the content of this article. Does Reasoning Make Search More Fair?13 References 1. Burke, R.: Multisided fairness for recommendation (2017), URL https:// arxiv.org/abs/1707.00093 2. DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z.F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai, J.L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang, M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M., Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R.J., Jin, R.L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S., Li, S.S., Zhou, S., Wu, S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Xiao, W.L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li, X.Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X., Shan, X., Li, Y.K., Wang, Y.Q., Wei, Y.X., Zhang, Y., Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Zhu, Y.X., Xu, Y., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z.Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z., Zhang, Z.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (2025), URL https://arxiv.org/abs/2501.12948 3. Diaz, F., Mitra, B., Ekstrand, M.D., Biega, A.J., Carterette, B.: Evalu- ating stochastic rankings with expected exposure. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Manage- ment, p. 275–284, CIKM ’20, ACM (Oct 2020), https://doi.org/10.1145/ 3340531.3411962, URL http://dx.doi.org/10.1145/3340531.3411962 4. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.: Fairness through awareness (2011), URL https://arxiv.org/abs/1104.3913 5. Ekstrand, M.D., McDonald, G., Raj, A., Johnson, I.: Overview of the trec 2022 fair ranking track (2023), URL https://arxiv.org/abs/2302.05558 6. Fabris, A., Baranowska, N., Dennis, M.J., Graus, D., Hacker, P., Saldivar, J., Zuiderveen Borgesius, F., Biega, A.J.: Fairness and bias in algorith- mic hiring: A multidisciplinary survey. ACM Trans. Intell. Syst. Technol. 16(1) (Jan 2025), ISSN 2157-6904, https://doi.org/10.1145/3696457, URL https://doi.org/10.1145/3696457 14S. Samuel et al. 7. Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., Wang, H.: Retrieval-augmented generation for large language models: A survey (2024), URL https://arxiv.org/abs/2312.10997 8. Grootendorst, M.: Keybert: Minimal keyword extraction with bert. (2020), https://doi.org/10.5281/zenodo.4461265, URL https://doi.org/10. 5281/zenodo.4461265 9. Holm, S.: A simple sequentially rejective multiple test procedure. Scandi- navian Journal of Statistics 6(2), 65–70 (1979), ISSN 03036898, 14679469, URL http://w.jstor.org/stable/4615733 10. Lawrie, D., MacAvaney, S., Mayfield, J., McNamee, P., Oard, D.W., Soldaini, L., Yang, E.: Overview of the trec 2023 neuclir track (2024), URL https: //arxiv.org/abs/2404.08071 11. Lee, M.H.J., Lai, C.K.: Implicit bias-like patterns in reasoning models (2025), URL https://arxiv.org/abs/2503.11572 12. Li, Z., Wang, C., Ma, P., Wu, D., Wang, S., Gao, C., Liu, Y.: Split and merge: Aligning position biases in llm-based evaluators. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, p. 11084–11108 (2024) 13. Liu, W., Ma, X., Sun, W., Zhu, Y., Li, Y., Yin, D., Dou, Z.: Reasonrank: Empowering passage ranking with strong reasoning ability (2025), URL https://arxiv.org/abs/2508.07050 14. Ma, X., Wang, L., Yang, N., Wei, F., Lin, J.: Fine-tuning llama for multi- stage text retrieval (2023), URL https://arxiv.org/abs/2310.08319 15. Menéndez, M.L., Pardo, J.A., Pardo, L., Pardo, M.C.: The jensen– shannon divergence. Journal of the Franklin Institute 334(2), 307– 318 (1997), ISSN 0016-0032, https://doi.org/10.1016/S0016-0032(96) 00063-4, URL https://w.sciencedirect.com/science/article/pii/ S0016003296000634 16. Nogueira, R., Jiang, Z., Pradeep, R., Lin, J.: Document ranking with a pre- trained sequence-to-sequence model. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, p. 708–718, Association for Computational Linguistics, Online (Nov 2020), https://doi.org/10.18653/v1/2020.findings-emnlp.63, URL https: //aclanthology.org/2020.findings-emnlp.63/ 17. OpenAI, :, Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., Mądry, A., Baker-Whitcomb, A., Beutel, A., Borzunov, A., Carney, A., Chow, A., Kir- illov, A., Nichol, A., Paino, A., Renzin, A., Passos, A.T., Kirillov, A., Chris- takis, A., Conneau, A., Kamali, A., Jabri, A., Moyer, A., Tam, A., Crookes, A., Tootoochian, A., Tootoonchian, A., Kumar, A., Vallone, A., Karpathy, A., Braunstein, A., Cann, A., Codispoti, A., Galu, A., Kondrich, A., Tulloch, A., Mishchenko, A., Baek, A., Jiang, A., Pelisse, A., Woodford, A., Gosalia, A., Dhar, A., Pantuliano, A., Nayak, A., Oliver, A., Zoph, B., Ghorbani, B., Leimberger, B., Rossen, B., Sokolowsky, B., Wang, B., Zweig, B., Hoover, B., Samic, B., McGrew, B., Spero, B., Giertler, B., Cheng, B., Lightcap, B., Walkin, B., Quinn, B., Guarraci, B., Hsu, B., Kellogg, B., Eastman, B., Lu- Does Reasoning Make Search More Fair?15 garesi, C., Wainwright, C., Bassin, C., Hudson, C., Chu, C., Nelson, C., Li, C., Shern, C.J., Conger, C., Barette, C., Voss, C., Ding, C., Lu, C., Zhang, C., Beaumont, C., Hallacy, C., Koch, C., Gibson, C., Kim, C., Choi, C., McLeavey, C., Hesse, C., Fischer, C., Winter, C., Czarnecki, C., Jarvis, C., Wei, C., Koumouzelis, C., Sherburn, D., Kappler, D., Levin, D., Levy, D., Carr, D., Farhi, D., Mely, D., Robinson, D., Sasaki, D., Jin, D., Valladares, D., Tsipras, D., Li, D., Nguyen, D.P., Findlay, D., Oiwoh, E., Wong, E., As- dar, E., Proehl, E., Yang, E., Antonow, E., Kramer, E., Peterson, E., Sigler, E., Wallace, E., Brevdo, E., Mays, E., Khorasani, F., Such, F.P., Raso, F., Zhang, F., von Lohmann, F., Sulit, F., Goh, G., Oden, G., Salmon, G., Starace, G., Brockman, G., Salman, H., Bao, H., Hu, H., Wong, H., Wang, H., Schmidt, H., Whitney, H., Jun, H., Kirchner, H., de Oliveira Pinto, H.P., Ren, H., Chang, H., Chung, H.W., Kivlichan, I., O’Connell, I., O’Connell, I., Osband, I., Silber, I., Sohl, I., Okuyucu, I., Lan, I., Kostrikov, I., Sutskever, I., Kanitscheider, I., Gulrajani, I., Coxon, J., Menick, J., Pachocki, J., Aung, J., Betker, J., Crooks, J., Lennon, J., Kiros, J., Leike, J., Park, J., Kwon, J., Phang, J., Teplitz, J., Wei, J., Wolfe, J., Chen, J., Harris, J., Varavva, J., Lee, J.G., Shieh, J., Lin, J., Yu, J., Weng, J., Tang, J., Yu, J., Jang, J., Candela, J.Q., Beutler, J., Landers, J., Parish, J., Heidecke, J., Schulman, J., Lachman, J., McKay, J., Uesato, J., Ward, J., Kim, J.W., Huizinga, J., Sitkin, J., Kraaijeveld, J., Gross, J., Kaplan, J., Snyder, J., Achiam, J., Jiao, J., Lee, J., Zhuang, J., Harriman, J., Fricke, K., Hayashi, K., Singhal, K., Shi, K., Karthik, K., Wood, K., Rimbach, K., Hsu, K., Nguyen, K., Gu-Lemberg, K., Button, K., Liu, K., Howe, K., Muthukumar, K., Luther, K., Ahmad, L., Kai, L., Itow, L., Workman, L., Pathak, L., Chen, L., Jing, L., Guy, L., Fe- dus, L., Zhou, L., Mamitsuka, L., Weng, L., McCallum, L., Held, L., Ouyang, L., Feuvrier, L., Zhang, L., Kondraciuk, L., Kaiser, L., Hewitt, L., Metz, L., Doshi, L., Aflak, M., Simens, M., Boyd, M., Thompson, M., Dukhan, M., Chen, M., Gray, M., Hudnall, M., Zhang, M., Aljubeh, M., Litwin, M., Zeng, M., Johnson, M., Shetty, M., Gupta, M., Shah, M., Yatbaz, M., Yang, M.J., Zhong, M., Glaese, M., Chen, M., Janner, M., Lampe, M., Petrov, M., Wu, M., Wang, M., Fradin, M., Pokrass, M., Castro, M., de Castro, M.O.T., Pavlov, M., Brundage, M., Wang, M., Khan, M., Murati, M., Bavarian, M., Lin, M., Yesildal, M., Soto, N., Gimelshein, N., Cone, N., Staudacher, N., Summers, N., LaFontaine, N., Chowdhury, N., Ryder, N., Stathas, N., Tur- ley, N., Tezak, N., Felix, N., Kudige, N., Keskar, N., Deutsch, N., Bundick, N., Puckett, N., Nachum, O., Okelola, O., Boiko, O., Murk, O., Jaffe, O., Watkins, O., Godement, O., Campbell-Moore, O., Chao, P., McMillan, P., Belov, P., Su, P., Bak, P., Bakkum, P., Deng, P., Dolan, P., Hoeschele, P., Welinder, P., Tillet, P., Pronin, P., Tillet, P., Dhariwal, P., Yuan, Q., Dias, R., Lim, R., Arora, R., Troll, R., Lin, R., Lopes, R.G., Puri, R., Miyara, R., Leike, R., Gaubert, R., Zamani, R., Wang, R., Donnelly, R., Honsby, R., Smith, R., Sahai, R., Ramchandani, R., Huet, R., Carmichael, R., Zellers, R., Chen, R., Chen, R., Nigmatullin, R., Cheu, R., Jain, S., Altman, S., Schoenholz, S., Toizer, S., Miserendino, S., Agarwal, S., Culver, S., Ether- 16S. Samuel et al. smith, S., Gray, S., Grove, S., Metzger, S., Hermani, S., Jain, S., Zhao, S., Wu, S., Jomoto, S., Wu, S., Shuaiqi, Xia, Phene, S., Papay, S., Narayanan, S., Coffey, S., Lee, S., Hall, S., Balaji, S., Broda, T., Stramer, T., Xu, T., Gogineni, T., Christianson, T., Sanders, T., Patwardhan, T., Cunninghman, T., Degry, T., Dimson, T., Raoux, T., Shadwell, T., Zheng, T., Underwood, T., Markov, T., Sherbakov, T., Rubin, T., Stasi, T., Kaftan, T., Heywood, T., Peterson, T., Walters, T., Eloundou, T., Qi, V., Moeller, V., Monaco, V., Kuo, V., Fomenko, V., Chang, W., Zheng, W., Zhou, W., Manassra, W., Sheu, W., Zaremba, W., Patil, Y., Qian, Y., Kim, Y., Cheng, Y., Zhang, Y., He, Y., Zhang, Y., Jin, Y., Dai, Y., Malkov, Y.: Gpt-4o system card (2024), URL https://arxiv.org/abs/2410.21276 18. Pradeep, R., Sharifymoghaddam, S., Lin, J.: Rankzephyr: Effective and ro- bust zero-shot listwise reranking is a breeze! (2023), URL https://arxiv. org/abs/2312.02724 19. Raj, A., Ekstrand, M.D.: Comparing fair ranking metrics (2022), URL https://arxiv.org/abs/2009.01311 20. Redi, M., Gerlach, M., Johnson, I., Morgan, J., Zia, L.: A taxonomy of knowledge gaps for wikimedia projects (second draft) (2021), URL https: //arxiv.org/abs/2008.12314 21. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr. 3(4), 333–389 (Apr 2009), ISSN 1554-0669, https://doi.org/10.1561/1500000019, URL https://doi. org/10.1561/1500000019 22. Sapiezynski, P., Zeng, W., E Robertson, R., Mislove, A., Wilson, C.: Quanti- fying the impact of user attentionon fair group representation in ranked lists. In: Companion Proceedings of The 2019 World Wide Web Conference, p. 553–562, W ’19, Association for Computing Machinery, New York, NY, USA (2019), ISBN 9781450366755, https://doi.org/10.1145/3308560. 3317595, URL https://doi.org/10.1145/3308560.3317595 23. Schuirmann, D.J.: A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics 15(6), 657–680 (1987) 24. Singh, A., Joachims, T.: Fairness of exposure in rankings. In: Proceed- ings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 2219–2228, KDD ’18, ACM (Jul 2018), https://doi.org/10.1145/3219819.3220088, URL http://dx.doi.org/ 10.1145/3219819.3220088 25. Su, H., Yen, H., Xia, M., Shi, W., Muennighoff, N., yu Wang, H., Liu, H., Shi, Q., Siegel, Z.S., Tang, M., Sun, R., Yoon, J., Arik, S.O., Chen, D., Yu, T.: Bright: A realistic and challenging benchmark for reasoning-intensive retrieval (2025), URL https://arxiv.org/abs/2407.12883 26. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models (2021), URL https://arxiv.org/abs/2104.08663 Does Reasoning Make Search More Fair?17 27. Wang, Y., Wu, X., Wu, H.T., Tao, Z., Fang, Y.: Do large language models rank fairly? an empirical study on the fairness of llms as rankers (2024), URL https://arxiv.org/abs/2404.03192 28. Weller, O., Ricci, K., Yang, E., Yates, A., Lawrie, D., Durme, B.V.: Rank1: Test-time compute for reranking in information retrieval (2025), URL https://arxiv.org/abs/2502.18418 29. Yang, E., Yates, A., Ricci, K., Weller, O., Chari, V., Durme, B.V., Lawrie, D.: Rank-k: Test-time reasoning for listwise reranking (2025), URL https: //arxiv.org/abs/2505.14432 30. Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., Baeza- Yates, R.: Fa*ir: A fair top-k ranking algorithm. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, p. 1569–1578, CIKM ’17, ACM (Nov 2017), https://doi.org/10.1145/ 3132847.3132938, URL http://dx.doi.org/10.1145/3132847.3132938 31. Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., Huang, F., Zhou, J.: Qwen3 embedding: Advancing text embedding and reranking through foundation models (2025), URL https: //arxiv.org/abs/2506.05176 32. Zhuang, S., Ma, X., Koopman, B., Lin, J., Zuccon, G.: Rank-r1: Enhanc- ing reasoning in llm-based document rerankers via reinforcement learning (2025), URL https://arxiv.org/abs/2503.06034