Paper deep dive
Public Profile Matters: A Scalable Integrated Approach to Recommend Citations in the Wild
Karan Goyal, Dikshant Kukreja, Vikram Goyal, Mukesh Mohania
Abstract
Abstract:Proper citation of relevant literature is essential for contextualising and validating scientific contributions. While current citation recommendation systems leverage local and global textual information, they often overlook the nuances of the human citation behaviour. Recent methods that incorporate such patterns improve performance but incur high computational costs and introduce systematic biases into downstream rerankers. To address this, we propose Profiler, a lightweight, non-learnable module that captures human citation patterns efficiently and without bias, significantly enhancing candidate retrieval. Furthermore, we identify a critical limitation in current evaluation protocol: the systems are assessed in a transductive setting, which fails to reflect real-world scenarios. We introduce a rigorous Inductive evaluation setting that enforces strict temporal constraints, simulating the recommendation of citations for newly authored papers in the wild. Finally, we present DAVINCI, a novel reranking model that integrates profiler-derived confidence priors with semantic information via an adaptive vector-gating mechanism. Our system achieves new state-of-the-art results across multiple benchmark datasets, demonstrating superior efficiency and generalisability.
Tags
Links
- Source: https://arxiv.org/abs/2603.17361v1
- Canonical: https://arxiv.org/abs/2603.17361v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
62,297 characters extracted from source content.
Expand or collapse full text
Public Profile Matters: A Scalable Integrated Approach to Recommend Citations in the Wild Karan Goyal IIIT Delhi, India karang@iiitd.ac.in Dikshant Kukreja ** IIIT Delhi, India dikshant22176@iiitd.ac.in Vikram Goyal IIIT Delhi, India vikram@iiitd.ac.in Mukesh Mohania IIIT Delhi, India mukesh@iiitd.ac.in Abstract Proper citation of relevant literature is essen- tial for contextualising and validating scientific contributions. While current citation recom- mendation systems leverage local and global textual information, they often overlook the nuances of the human citation behaviour. Re- cent methods that incorporate such patterns improve performance but incur high compu- tational costs and introduce systematic biases into downstream rerankers. To address this, we propose Profiler, a lightweight, non-learnable module that captures human citation patterns ef- ficiently and without bias, significantly enhanc- ing candidate retrieval. Furthermore, we iden- tify a critical limitation in current evaluation protocol: the systems are assessed in a trans- ductive setting, which fails to reflect real-world scenarios. We introduce a rigorous Inductive evaluation setting that enforces strict temporal constraints, simulating the recommendation of citations for newly authored papers in the wild. Finally, we present DAVINCI, a novel rerank- ing model that integrates profiler-derived con- fidence priors with semantic information via an adaptive vector-gating mechanism. Our sys- tem achieves new state-of-the-art results across multiple benchmark datasets, demonstrating su- perior efficiency and generalisability. 1 Introduction The rapid expansion of scientific research has led to an exponential surge in published literature (Drozdz and Ladomery, 2024; Rousseau et al., 2023). This information deluge presents a sig- nificant bottleneck for researchers attempting to identify and integrate relevant prior work (Datta et al., 2024; Bhagavatula et al., 2018). Conse- quently, there is a critical need for automated sys- tems that can efficiently streamline the citation pro- cess (Goyal et al., 2024; Gu et al., 2022). ** implemented Profiler & ran open-source rerankers. Citation recommendation methodologies are generally categorised into two paradigms: “global" (Ni et al., 2024; Ali et al., 2021; Xie et al., 2021) and “local" (Jeong et al., 2020; Dai et al., 2019; Ebesu and Fang, 2017; Huang et al., 2015; Livne et al., 2014; He et al., 2010). While global rec- ommendation suggests papers based on the overall theme of a document, local citation recommenda- tion (LCR) operates at a fine-grained level, and is the focus of this research work. LCR targets specific “citation contexts" or excerpts, aiming to suggest references that align semantically and con- ceptually with the immediate narrative of a passage. State-of-the-art (SOTA) LCR systems typically leverage metadata like titles and abstracts alongside citation contexts. For instance, SymTax (Goyal et al., 2024) utilises a three-stage architecture in- volving a prefetcher, an “enricher" to capture sym- biotic neighbourhood relationships, and a reranker. However, this approach faces three major chal- lenges. First, the enricher mimics human citation behaviour, i.e., specifically the tendency to cite from a narrow pool of seminal works, which while effective, introduces and perpetuates inherent “con- firmation bias" in the citation ecosystem. Second, the three-stage candidate retrieval process imposes significant computational overhead. Third, its re- liance on paper-specific taxonomy limits general- isability, as such metadata is often unavailable in benchmark datasets. More recently, (Çelik and Tekir, 2025) proposed CiteBART to generate parenthetical author-year strings directly for an input citation context. We identify two critical flaws in this setup: (i) the generative nature leads to hallucinations of non- existent citations, and (i) the framework is seman- tically decoupled from the research content. By focusing on “author-year" strings, the model treats research as a function of primary authors’ names (e.g., “Celik" or “Goyal") rather than the substan- tive scientific content, which is fundamentally in- arXiv:2603.17361v1 [cs.IR] 18 Mar 2026 dependent of such identifiers. Moreover, we shed light on the current training and evaluation practice of LCR systems operating in a setting that devi- ates from real-world scenarios. To address these limitations, we make following contributions: • We introduce the Profiler, a lightweight, non- learnable module for candidate retrieval. It is re- markably efficient and free from confirmational bias, yet it outperforms the sequential combina- tion of prefetcher and enricher. •We demonstrate the importance of a paper’s “pub- lic profile", i.e., how the research ecosystem per- ceives a paper, as a remarkably vital signal for recommendation. • We develop the DAVINCI reranker, which dis- criminatively integrates confidence priors with textual semantics via an adaptive vector-gating mechanism. Unlike previous SOTA, it is architec- turally generalisable across diverse datasets with- out requiring special metadata like taxonomies. •We establish a new state-of-the-art, demonstrat- ing that DAVINCI surpasses both specialised LCR systems and massive-scale open-source rerankers adapted for this task. •Finally, we introduce and benchmark LCR in an inductive setting, providing a more realistic evaluation framework for citations “in the wild." 2 Related Work Early investigations, such as that by He et al. (2010); Livne et al. (2014); Huang et al. (2015); Ebesu and Fang (2017); Dai et al. (2019), formally introduced local citation recommendation, utilising approachs ranging from TF-IDF based vector sim- ilarity to bidirectional LSTMs for modelling con- textual information. In an effort to integrate both contextual signals and graph-based signals, Jeong et al. (2020) proposed the BERT-GCN model. This model leverages BERT (Kenton and Toutanova, 2019) to generate contextualised embeddings for citation contexts, capturing semantic nuances. Si- multaneously, it employs a Graph Convolutional Network (GCN) (Kipf and Welling, 2017) extract- ing structural information from citation network, to determine the relevance between context and po- tential citations. However, as noted by Gu et al. (2022), the computational intensity inherent in GCNs posed a significant practical challenge. Con- sequently, the BERT-GCN model’s evaluation was constrained to small datasets with only a few thou- sand citation contexts. This limitation emphasises a critical scalability bottleneck for GNN-based rec- ommendation models when applied to large-scale datasets, highlighting the need for more computa- tionally efficient techniques. Medi ́ c and Šnajder (2020) explored the integra- tion of global document information to enhance citation recommendation. However, as reported in Gu et al. (2022) and Goyal et al. (2024), it creates an artificial setup which in reality does not exist. Ostendorff et al. (2022) suggested a graph-centric approach (SciNCL), utilising neighbourhood con- trastive learning across the complete citation graph to generate informative citation embeddings. These embeddings facilitate efficient retrieval of top rec- ommendations using k-nearest neighbourhood in- dexing. Recently, Gu et al. (2022) introduced an efficient two-stage recommendation architecture (HAtten) which strategically separates the recom- mendation process into rapid prefetching stage and a more refined reranking stage, optimising for both speed and accuracy. Building upon HAtten, Goyal et al. (2024) proposed a three-stage recommenda- tion architecture (SymTax) composed of prefetcher, enricher and reranker, establishing state-of-the-art in local citation recommendation. Very recently, Çelik and Tekir (2025) performed continual pre- training of BART-base to generate correct paren- thetical author-year citation for a given context. Crucially, this generative approach relies heavily on author-year surface forms rather than the un- derlying research contributions. This creates a semantic bottleneck where the model prioritises bibliographic identifiers over the actual scientific content, which is inherently independent of the authors’ identities. 3 Proposed Work Problem Formulation. We formulate the task of local citation recommendation as a two-stage retrieval and reranking problem, designed to han- dle the immense scale of modern scholarly corpora. Given a query instanceq = (S q ,M q )— compris- ing a snippet of citation contextS q and the source document’s meta informationM q characterised by its titleT q and abstractA q — and a large corpus of scientific documentsC =D i , the process is as follows. First, in the retrieval stage, our novel Profiler module efficiently retrieves an initial can- didate setC q ⊂ C, where|C q | ≪ |C|. For each candidate documentc i ∈ C q , Profiler also yields a confidence score,s i , which serves as an initial Dataset TransductiveInductive # Contexts # Papers # Contexts # Corpus TrainValTestTrainValTest ACL-20030,3909,3819,58519,77630,3908,5127,0727,108 FTPR9,3634926,8144,8379,3634725,9183,313 RefSeer3,521,582124,911126,593624,9573,521,582117,724105,411580,059 arXiv2,988,030112,779104,4011,661,2012,988,030103,12595,247700,403 ArSyTa8,030,837124,188124,189474,3418,030,837123,515122,989412,127 Table 1: The impact of our rigorous inductive setting. Enforcing temporal consistency corrects the inflation in corpus and evaluation sets seen in standard benchmarks, resulting in a markedly smaller and more realistic set of documents for training and inference. ‘FTPR’: Full- TextPeerRead. estimate of its relevance. Second, in the rerank- ing stage, our proposed DAVINCI model ingests this candidate set and their associated confidence scores. It computes a final, fine-grained relevance score,f DAVINCI (q,c i ,s i ), by fusing a deep seman- tic analysis of the content with the discriminative priors obtained by refining the confidence signal from the Profiler. The final output is a ranked list L q of the documents inC q , sorted in descending order based on their DAVINCI scores, representing the most suitable citations for a given context. 3.1 Inductive Setting: Rethinking Evaluation Protocol A central contribution of our work is to address a fundamental yet often overlooked limitation in the standard evaluation protocol for citation rec- ommendation. Traditionally, models are evaluated in a transductive setting. In this setup, the corpus of candidate documents is often constructed from the union of training, validation and test sets, and also the unparsable documents. While this does not lead to direct data leakage (i.e., using test labels for training), it creates an artificial evaluation land- scape. Specifically, the ground truth citation for a given test query itself may be another document within the test set. This means the system is evalu- ated on its ability to find connections within a static collection where the query documents themselves are pre-indexed and searchable which is a condi- tion that never holds in a real-world application. To faithfully address this shortcoming, we define and adopt a rigorous inductive evaluation setting. The core principle of the inductive setting is to enforce a strict temporal separation between the evaluation query and the candidate corpus, mirroring the natu- ral arrow of time in research. Formally, letD eval be an evaluation set (either the validation set,D val , or the test set,D test ), and letCbe the candidate cor- pus available for recommendation. The inductive setting imposes two critical constraints: 1.Disjoint Sets: The set of evaluation documents and the candidate corpus must be strictly dis- joint, as defined by: D eval ∩ C =∅(1) 2. Temporal Consistency: For any query docu- mentD q ∈ D eval , the candidate corpusCmust only contain documents published strictly be- fore D q , formalised as: ∀D q ∈ D eval ,∀D i ∈ C : date(D i ) < date(D q ) (2) This setup ensures that, at evaluation time, a model is tasked with recommending citations for a “newly authored” paper (D q ) using only the body of “existing” literature (C). By adopting this induc- tive protocol, we eliminate any artificial advantage gained from a pre-known test set and obtain a more realistic and reliable assessment of a model’s true generalisation capabilities. All experiments and benchmarks presented in this paper are conducted under this stringent inductive setting to ensure a fair and meaningful comparison. We show the statistics for benchmark datasets in Table 1. 3.2 Profiler: A Non-Learnable First-Stage Retrieval The first stage of our system is the Profiler, a novel retrieval module designed to overcome the compu- tational bottlenecks inherent in current state-of-the- art citation recommendation systems. Its design philosophy is rooted in decoupling the expensive process of representation enrichment of documents from the online query task. A key technical merit of Profiler is that it is entirely a non-learnable module. It operates as a principled, static transformation of the citation network, making it exceptionally fast and scalable. The name ‘Profiler’ reflects its core function: to compute a rich public profile for ev- ery document. We posit that a paper’s relevance is a function of both its intrinsic content and its perceived identity within the scholarly network, i.e., an identity shaped by its citing papers and the contexts of those citations. Profiled Document Representations: A Static Enrichment Process. The Profiler’s first task is a one-off, offline pre-processing step: transforming the entire corpus into a profiled citation network. For every documentD i ∈ C, we begin by initial- ising its base vector representation,v i ∈R d ENC 1 , 0.00 0.20 0.40 0.60 0.80 1.00 (Context Parameter) 0.00 0.20 0.40 0.60 0.80 1.00 (Meta Parameter) 0.350 0.400 0.450 0.500 0.550 Recall@10 Optimal Parameters: = 0.800 = 0.300 Recall@10 = 0.5640 0.400 0.420 0.440 0.460 0.480 0.500 0.520 0.540 0.560 Recall@10 Figure 1: Navigating the performance landscape of pub- lic profile on ACL-200 validation set. using a small pre-trained language model encoder, ENC 1 (·). We use specter2_base (Singh et al., 2022) as encoder due to its better performance observed with citation networks (Goyal et al., 2024). To construct the profile ofD i , we augment this base representation with signals from its inward ego net- work,N in (D i ), which is the set of documents that citeD i . For each citing documentD j ∈N in (D i ), we extract two distinct signals: the representation of the citing paper’s content,v j , and the representa- tion of the specific citation context snippet,v s ji , in which the citation is made. The final profiled repre- sentation,ˆv i , for documentD i is a static fusion of these signals as shown below: ˆv i = v i + 1 |N in (D i )| X D j ∈N in (D i ) (α· v s ji + β· v j ) (3) Here,α ∈ [0, 1]andβ ∈ [0, 1]are non-learnable hyperparameters, whereα + β = 1. This inher- ently robust design formulation provides a crucial regularising effect. For a very recent paper with no citations (|N in (D i )| = 0), the profiled repre- sentation naturally defaults to its base semantic vector,v i , directly tackling the cold start problem. Concurrently, the averaging mechanism ensures that the profiles of highly-cited papers are not un- duly skewed, while effectively modelling papers from emerging fields with sparse citations and in- terdisciplinary work with diverse citation patterns. Crucially, to eliminate potential biases, we delib- erately discard explicit signals of impact such as raw citation counts, venue prestige, or publication timelines, irrespective of presence. Query Formulation and Efficient Cosine Sim- ilarity Search. For an incoming queryq = (S q ,M q ) , we formulate a composite query repre- sentation, v q , using a similar curation strategy: v q = γ· ENC 1 (S q ) + δ· ENC 1 (M q ) = γ· ENC 1 (S q ) + δ· ENC 1 (T q ⊕ A q ) (4) where⊕denotes textual concatenation, andγ ∈ [0, 1]andδ ∈ [0, 1]are non-learnable hyperparam- eters constrained byγ + δ = 1. With the entire corpus of profiled vectors (ˆv i ) pre-computed and indexed, the retrieval stage is reduced to a remark- ably efficient similarity search. We employ cosine similarity to score the relevance of each candidate document against the query: Score(q,D i ) = cosine(v q , ˆv i )(5) The resulting similarity scores are not only used to rank the initial candidate listC q but are also passed directly to DAVINCI as a valuable set of confidence scores,s i . Hyperparameter Selection.The values for non- learnable hyperparameters (α,β,γ,δ) are deter- mined empirically via a systematic sweep anal- ysis on the validation sets of two of our smaller datasets. Crucially, as shown in Fig.1, the opti- mal set of values identified from this constrained analysis is then applied universally across all larger datasets without further tuning to ensure generalisa- tion. Our analysis revealed that a specific ratio, i.e., the one that moderately prioritises the local con- text signal over the global document topic yields consistently strong performance. This finding un- derscores that the effectiveness of Profiler doesn’t lie in dataset-specific tuning, but in its ability to capture a fundamental and generalisable structural property of scholarly networks. Please refer the technical appendix (Fig. 5) for a detailed analysis. 3.3 The DAVINCI Reranking Architecture The efficacy of second-stage reranker is fundamen- tally constrained by its ability to enrich the seman- tic information of the query and the candidates obtained from the first-stage retrieval. We posit that state-of-the-art performance hinges not merely on the power of semantic encoding, but also on the confidence priors. Moreover, it depends on the sophistication of fusion mechanism that reconciles these modalities. To this end, we introduce DAVINCI (Discrim- inative & Adaptive Vector-gated Integration of Network Confidence & Information). It is founded on two core concepts: (i) a principled, non-linear transformation to refine the low infor- mation signal from the retrieval stage, and (i) a Similarity Search ENC 2 [ CLS ] + Citation Context + Title + Abstract + [ SEP ] + Title + Abstract QueryCandidate ENC 1 ENC 1 [citation context] [title + abstract] ENC 1 푣 q Citation Network [textual] Citation Network [vectorised] Retrieved Candidates Confidence Scores Candidate Set (C q ) Citation Network [profiled] c i Top-k 푣ᵢ Profiling 푣̂ᵢ PROFILER 훾 훿 푣ᵢ [ Soft Masking Layer ] + Output Head DAVINCI Static Enrichment Query Formulation S qi Retrieval s i Ordinal Abstraction Non-linear Remapping Prior Discriminator Linear Projection 1 Linear Projection 2 Score Projection Tower Linear Projection 1 Linear Projection 2 Text Projection Tower Linear Projection 1 Linear Projection 2 Gating Network e cls h text h score h fused h concat g p i Learning Objective Citing Paper: Query (q) Title Abstract Citation Context Recommended Papers ⍺ β ⍺ β Figure 2: The architecture of our two-stage citation recommendation system. (1) The non-learnable Profiler performs a scalable retrieval by matching the query against a corpus of documents enriched with their public profile. (2) DAVINCI reranks the retrieved candidates using a vector-gated mechanism to integrate the discriminative retrieval priors with deep semantic features to produce a final ranked list of recommended papers for citation. novel fashion that creates a soft masking mecha- nism to achieve a dynamic and fine-grained fusion of signals. Finally, the reranker is optimised end- to-end using contrastive learning. From Degenerate Scores to Discriminative Pri- ors. A prerequisite for effective fusion is the availability of well-informed input signals. Raw cardinal scores from dense retrievers often ex- hibit severe score compression, providing a low- information signal with poor discriminative capac- ity. We therefore introduce a deterministic pre- processing block to transform this signal into a robust retrieval prior. (i) Ordinal Abstraction: We obtain a 1-indexed rank list r i from the list of cardinal scores s i . For any ground-truth can- didate not found in the profiler’s output (e.g., an oracle-provided positive injected for training), we assign a default rank ofk + 1, wherek = |C q |. (i) Non-Linear Remapping: The resulting inte- ger ranks, while robust, are both linearly spaced and numerically large, and thus fails to capture the power-law distribution of relevance in ranked lists. These large integer values can be problematic for gradient-based optimisation, potentially lead- ing to unstable training or exploding gradients. To address both issues simultaneously, we apply a non- linear exponential decay function to map the rank r i to final transformed prior p i : p i = λ r i (6) whereλ ∈ (0, 1)is a decay-rate hyperparameter, empirically set to 0.95. This transformation yields a geometrically spaced, continuous prior that more accurately models the steep non-linear decay of relevance probability. This transformed priorp i serves as the definitive retrieval signal for all sub- sequent model components. Adaptive Gated Fusion. The DAVINCI design is engineered to leverage the discriminative confi- dence priorp i and fuse it intelligently with the raw semantic information. The semantic information is obtained by textually concatenating query text with the candidate text using a [SEP] token and en- coding it using a small pretrained language model, ENC 2 (·). We use SciBERT (Beltagy et al., 2019) as encoder due to its better performance observed with non-graph fusion techniques (Goyal et al., 2024). We extract the [CLS] token’s final hidden state,e cls ∈R d ENC 2 , as the raw semantic representa- tion. To enable fusion, the heterogeneous inputs are first mapped into a commond h -dimensional latent space via two independent Multi-Layer Perceptron (MLP) towers representing modality specific pro- jection networks: • Text Projection Tower (MLP text ): Learns a non- linear mappingf text :R d ENC 2 →R d h , yielding a task-specific text representationh text . • Score Projection Tower (MLP score ): Learns a mappingf score :R→R d h , vectorising the scalar prior p i into a dense score representationh score . To obtain the final processed semantic information, projected representations are concatenated as: h concat = [h text ;h score ]∈R 2d h (7) A separate Gating Network,MLP gate , com- putes a vector-valued gateg. This network is con- ditioned on original input signals (e cls andp i ) to form an unbiased assessment of the raw evidence: g = σ(MLP gate ([e cls ;p i ]))∈R 2d h (8) Here,σis the element-wise sigmoid function, which constrains each element of the gating vector gto the range(0, 1). Each elementg j can be inter- preted as a learned throughput coefficient for the j-th feature. The final fusion is executed via the Hadamard product (⊙), which applies the gateg as a per-dimension soft mask: h fused =g⊙h concat (9) This operation constitutes a form of element-wise feature modulation, providing a degree of represen- tational flexibility unattainable with scalar fusion methods. The adaptively fused vector,h fused , is passed to a dedicated Output Head, a final MLP (MLP out ), which maps the2d h -dimensional repre- sentation to a single logit. A final sigmoid activa- tion produces the final reranked DAVINCI score S qi = f DAVINCI (q,c i ,s i ) as shown below S qi = σ(MLP out (h fused ))∈ (0, 1)(10) It represents system’s final confidence that candi- date document c i is a relevant citation for q. Learning Objective: Direct Optimisation of Ranking. To align model’s training with its downstream evaluation, we use a loss function that directly optimises the relative ordering of candi- dates. The training process is structured around queries and their associated sets ofkretrieved doc- ument candidates, which are labeled as positive (c + ) or negative (c − ) based on ground-truth rele- vance. To construct robust training instances and expose the model to a diverse set of negative sig- nals, we adopt a negative sampling strategy. For a positive candidatec + associated with a query q, we compare it with randomly samplednnega- tive candidates, denoted asc − 1 ,c − 2 ,...,c − n , from the pool ofkretrieved candidates for the query. This process yieldsndistinct training triplets for a positive example. For each triplet(q,c + ,c − j ), the model computes the respective scores,S + andS − j . We then optimise the model using the margin-based Model Comp. Time MRRRecall@KNDCG@K 10503001050300 ACL-200 Prefetcher 56.22m 21.14 40.33 65.37 86.98 24.57 30.11 33.30 Pref+Enr64.43m 21.16 40.33 65.37 88.93 24.57 30.11 33.48 Profiler2.52m30.17 53.79 74.63 89.58 34.88 39.57 41.78 FullTextPeerRead Prefetcher 45.61m 21.73 39.17 63.43 87.16 24.78 30.15 33.63 Pref+Enr49.20m 21.76 39.17 63.43 88.40 24.78 30.15 33.97 Profiler1.12m31.62 57.23 82.05 96.27 36.62 42.23 44.35 Refseer Prefetcher 99.17h 11.88 22.72 41.88 66.76 13.56 17.77 21.39 Pref+Enr101.43h 11.92 22.72 41.88 69.91 13.56 17.77 21.88 Profiler3.10h16.65 32.18 52.46 72.17 19.40 23.91 26.80 arXiv Prefetcher 84.31h 13.78 27.09 48.83 74.16 15.94 20.73 24.43 Pref+Enr85.94h 13.80 27.09 48.83 76.24 15.94 20.73 24.96 Profiler2.72h16.61 33.41 55.95 76.61 19.56 24.57 27.61 ArSyTa Prefetcher 225.88h 7.89 15.52 31.08 56.00 8.96 12.36 15.95 Pref+Enr236.14h 7.94 15.52 31.08 66.59 8.96 12.36 17.31 Profiler7.26h13.01 26.36 47.46 69.35 15.17 19.84 23.04 Table 2: Our retrieval module (Profiler) consistently outperforms the SOTA baselines on all datasets across metrics and also with respect to the computational timing. Pref+Enr refers to sequential combination of Prefetcher followed by Enricher, leading to higher Re- call@300 and NDCG@300 while keeping the same metric values for K=10 and K=50 as per its enrichment principle. Experiments are run on NVIDIA A100 DGX. triplet loss, applied individually to each pair: L(S + ,S − j ) = max(0,S − j − S + + m)(11) wherem∈ (0, 1)is a margin hyperparameter. The total loss for a positive samplec + is the average sum of losses computed over thesensampled nega- tives: 1 n P n j=1 L(S + ,S − j ) .This objective function directly penalises incorrect rank-ordering across a varied subset of competitors, forcing the model to learn a scoring function that produces a well- separated ranking of candidates (cf. Figure 2). 4 Experiments and Results Experimental Setup. We benchmark all the baselines and datasets outlined in the current state- of-the-art work, SymTax and conduct all exper- iments under the realistic inductive setting. We exclude Çelik and Tekir (2025) as it relies on addi- tional task-specific parameters that are not defined for the problem setting considered in this work. To provide a multi-faceted assessment of ranking performance, we employ a suite of standard infor- mation retrieval metrics (%), namely, Mean Recip- rocal Rank (MRR), Recall@K, and NDCG@K. ModelMRRRecall@KNDCG@K 5102051020 ACL-200 BM2510.5315.4520.8226.7110.7112.4413.92 SciNCL15.4121.3930.0439.7615.0117.7920.24 HAtten45.5358.9368.2475.7847.3250.3452.25 SymTax46.9860.2069.4776.8348.7351.7553.62 Ours50.3164.1073.0880.2052.3055.2257.03 FullTextPeerRead BM2516.6024.5031.1538.2317.2719.4221.23 SciNCL17.8025.3135.4346.4817.5320.7723.57 HAtten55.0368.6075.5880.6257.3359.6060.88 SymTax56.6369.9476.9282.2958.8461.1162.47 Ours59.6874.4182.1787.4262.1664.6866.02 Refseer BM2510.8515.3119.7124.5011.1112.5213.73 SciNCL7.1710.0214.6820.466.748.239.69 HAtten30.6439.4145.7851.4132.0133.7234.98 SymTax31.8040.6147.2453.2532.7934.9436.46 Ours32.5742.1949.5256.3733.6236.0037.73 arXiv BM2510.2814.6419.0423.8910.5011.9313.15 SciNCL9.2213.0618.3724.898.9110.6112.25 HAtten28.1337.0145.0652.3228.8631.3632.37 SymTax29.0238.4646.7854.9729.8032.4934.56 Ours30.4640.8649.8958.5031.3834.3136.49 ArSyTa BM259.2413.3917.5222.149.4610.7911.96 SciNCL8.1611.2515.7121.087.859.2810.64 HAtten19.9227.7034.9042.2520.5022.8324.69 SymTax22.0030.1638.0646.0322.4925.0527.07 Ours24.0133.7442.8351.5624.7327.6729.89 Table 3: Our end-to-end citation recommendation sys- tem (Ours) consistently outperforming all baselines. Results: First Stage Retrieval.We compare the results of our Profiler with the current state-of-the- art Prefetcher (Gu et al., 2022) and the sequential combination of Prefetcher followed by Enricher (Goyal et al., 2024). Prefetcher operates on a hier- archical attention based text encoding to obtain a retrieved candidate list. Enricher ingests top 100 candidates from this prefetched list and models their symbiotic relationship embedded in the cita- tion network to curate an enriched list of retrieved candidates, thus yielding a significantly higher Re- call@300. In Table 2, results show that the non- learnable and scalable nature of Profiler makes it highly computationally efficient in reducing the re- trieval time by32.52x and43.92x on the largest dataset (ArSyTa) and the smallest dataset (Full- TextPeerRead), respectively. Results also show Profiler’s merit to retrieve better candidates by in- creasing the MRR by63.85%and45.3%on Ar- SyTa and FullTextPeerRead, respectively. Results: End-to-End System. We evaluate our complete system with other standard baselines in Table 3 as detailed in our experimental setup. We outperform the SOTA citation recommendation sys- tems and establish a new state-of-the-art on all 0.0 0.2 0.4 0.6 0.8 1.0 (Meta Parameter) 0.0 0.1 0.2 0.3 0.4 Recall@10 Arsyta (max=0.262 at =0.50) arXiv (max=0.289 at =0.50) Refseer (max=0.290 at =0.20) ACL-200 (max=0.386 at =0.40) FullTextPeerRead (max=0.436 at =0.30) Figure 3: The indispensable role of the public profile. Disabling profile enrichment causes a severe and consis- tent collapse in retrieval performance across all datasets. datasets across all metrics. 5 Analysis To dissect the contributions of our core design choices, we conduct a series of targeted ablation studies on both the Profiler and the DAVINCI reranker. These analyses are designed to vali- date our architectural hypotheses and quantify the impact of each novel component. Additionally, we present both the quantitative analysis and the qualitative analysis in the technical appendix (A.1) owing to the page limit. Profiler.We perform two key analyses to validate the efficacy of the public profile concept and its implementation in the Profiler. In Figure 1, we vi- sualise and navigate the landscape of public profile corresponding to ACL-200 dataset for Recall@10, clearly depicting the entire spectrum of public pro- file. We show further analyses in the technical appendix (Fig. 5). To measure the performance gain enabled by profiling, we conduct an ablation where the profile enrichment is turned off (i.e., set- tingα = 0andβ = 0in Equation 3, soˆv i = v i ). As shown in Fig. 3, we observe a sharp degrada- tion in retrieval performance for all datasets across varied query compositions (i.e., differentγ,δval- ues). Moreover, we observe that large and tough datasets are relatively more robust to varied query compositions in this case. This directly confirms that profiling is not merely a hypothetical construct but a vital signal for effective first-stage retrieval. ModelMRRRecall@KNDCG@K 5102051020 ACL-200 Ours50.3164.1073.0880.2052.3055.2257.03 A148.4262.4271.3278.3850.4453.3455.13 A248.3061.8570.7577.8150.2053.1054.89 A349.4662.6771.8378.4951.2754.2655.96 A445.1657.6666.0572.9946.8149.5451.30 FullTextPeerRead Ours59.6874.4182.1787.4262.1664.4866.02 A158.0872.8880.3086.1960.5662.9664.46 A258.2072.7880.2086.2660.6163.0364.56 A358.4972.9080.8886.2360.8363.4264.78 A453.5868.0475.9082.2255.9058.4660.08 Table 4: Ablation analysis showing the impact of our design choices w.r.t. our complete system, namely, A1 (Semantics Only), A2 (Turned-off Discriminator), A3 (Softmax Normalisation), and A4 (Scalar Gating). DAVINCI. To isolate the contribution of each component within DAVINCI, we conduct four ab- lation studies, systematically deconstructing the full model. The results for these ablations on the FullTextPeerRead and ACL-200 datasets are pre- sented in Table 4, and are described as follows (1) Semantics Only: We discard the use of network confidence scores. This experiment is designed to quantify the value of integrating the Profiler’s retrieval confidence into the reranking stage. (2) Turned-off Discriminator: We bypass our signal refining process (ordinal abstraction and exponen- tial remapping) and instead feed the raw, untrans- formed confidence scores from the Profiler to tes- tify the necessity of our proposed transformation for handling low-information retrieval signals. (3) Softmax Normalisation: We replace our discrimi- native transformation with a standard softmax func- tion applied to the retrieval scores of the top-k can- didates. This provides a direct comparison of our principled remapping scheme against a common baseline for score normalisation. (4) Scalar Gat- ing: We replace the vector-gating mechanism with scalar gating of semantic information controlled by discriminative prior. This experiment directly measures the performance gain attributable to our fine-grained, per-dimension adaptive fusion policy. 6 Comparison with Massive-Scale Rerankers We conduct an experiment to answer a critical ques- tion: Can a compact, purpose-built reranker like DAVINCI outperform general-purpose reranking models with orders of magnitude more parame- ters? We evaluate against the current state-of-the- art reranking models, including the latest Qwen3- ModelMRRRecall@KNDCG@K 5102051020 ACL-200 DAVINCI50.31 64.10 73.08 80.20 52.30 55.22 57.03 Qwen3-R-8B36.44 50.96 63.06 72.83 38.02 41.94 44.42 bge-R-v2-m-40 33.52 45.27 55.23 64.70 34.55 37.78 40.17 FullTextPeerRead DAVINCI59.68 74.41 82.17 87.42 62.16 64.68 66.02 Qwen3-R-8B48.15 66.84 77.62 85.62 51.08 54.60 56.63 bge-R-v2-m-40 41.22 53.71 63.87 73.16 42.44 45.75 48.11 Refseer DAVINCI32.57 42.19 49.52 56.37 33.62 36.00 37.73 Qwen3-R-8B24.81 35.39 44.98 54.04 25.67 28.79 31.09 bge-R-v2-m-40 22.10 30.27 38.39 46.58 22.53 25.15 27.22 arXiv DAVINCI30.46 40.86 49.89 58.50 31.38 34.31 36.49 Qwen3-R-8B25.48 36.02 47.19 57.35 26.10 29.72 32.30 bge-R-v2-m-40 21.70 29.79 38.10 46.87 21.99 24.68 26.89 ArSyTa DAVINCI24.01 33.74 42.83 51.56 24.73 27.67 29.89 Qwen3-R-8B22.39 32.44 40.71 49.33 23.26 25.95 28.13 bge-R-v2-m-40 17.79 24.70 31.73 38.79 18.03 20.31 22.08 Table 5: Performance of DAVINCI (110M) vs. massive- scale rerankers. ‘R’: Reranker; ‘m’: minicpm. Reranker-8B (Zhang et al., 2025) and bge-reranker- v2-minicpm-40 having 2.72B parameters (Chen et al., 2024; Li et al., 2023). In contrast, our DAVINCI model is exceptionally lightweight, com- prising only 110M parameters. To ensure a fair comparison, we standardise the retrieval stage for all models: each reranker is provided with the exact same list of candidate documents retrieved by our Profiler module, and we evaluate the performance on same test sets used for DAVINCI. Due to the immense size of these rerankers and their general- purpose pre-training, we employ instruction-aware prompting to adapt them to our specific task and datasets, as detailed in the appendix (A.3). De- spite being up to 70x smaller than the latest SOTA reranker, our specialised model markedly out- performs general-purpose models on all datasets, demonstrating the merit of task-specific design over raw parameter scale in an era of massive models. 7 Conclusion This work presents a principled re-evaluation of the citation recommendation task, advancing the field on two fundamental fronts: the veracity of its benchmarks and the efficiency of its architec- tures. By instituting a rigorous inductive proto- col, we first establish a more faithful measure of the real-world performance. Next, our proposed two-stage system, pairing a non-learnable retriever with a specialised gated reranker, sets a new bench- mark for both retrieval and end-to-end recommen- dation. The strong performance of our compact, 110M-parameter model against multi-billion pa- rameter rerankers underscores a key finding: for specialised domains, architectural sophistication, task-aligned design choices and the integration of domain-specific knowledge are more salient drivers of success than just the raw parameter count. 8 Limitations This document purely presents a work of re- search and is not about productising via develop- ing a digital assistant. While our proposed frame- work achieves state-of-the-art performance and ad- dresses several systemic bottlenecks in citation rec- ommendation, it is subject to several limitations. First, our evaluation is primarily constrained to the English language and specific scientific domains, namely Computer Science. While the underly- ing mechanisms of the Profiler and the DAVINCI reranker are theoretically domain-agnostic, the stylistic nuances of “citation contexts" in humani- ties or social sciences may differ. Second, although we introduce an inductive setting to better simu- late real-world conditions, our system still faces a partial cold-start challenge for “absolute" new papers. Since the Profiler leverages the collective perception of the research ecosystem, its utility may be diminished to an extent for extremely re- cent publications that have not yet been integrated into the citation network, leaving the recommenda- tion to rely solely on textual semantic alignment. Furthermore, like most Transformer-based archi- tectures, our reranker is limited by a maximum in- put sequence length. In instances where a citation requires a global understanding of a very long doc- ument or a complex multi-paragraph narrative, the 512-token window may truncate essential context. Lastly, performance of the system remains contin- gent on the quality of available metadata; missing abstracts or poorly parsed titles in the source cor- pus could lead to suboptimal candidates during the retrieval phase and thus the final recommendation. 9 Ethical Considerations The development of automated citation recommen- dation systems carries significant implications for the scientific community. A primary concern is the potential for popularity bias wherein already highly-cited papers are disproportionately recom- mended, further marginalising niche or emerging research. While we have designed the Profiler to be more objective than previous enricher module, any system trained on historical citation data inher- ently risks perpetuating existing human biases. We emphasise that LCR systems should not replace a researcher’s responsibility to conduct a thorough and critical literature review. Over-reliance on such systems could lead to lazy citing where authors cite suggested papers without fully engaging with the source material. Furthermore, we recognise the theoretical risk of citation manipulation, where recommendation algorithms could be gamed to ar- tificially boost the visibility of specific authors or institutions. To mitigate this, we advocate for trans- parency and will make our code and trained models publicly available for community audit. Finally, we address the environmental impact of our work by prioritising computational efficiency. By design- ing a lightweight, non-learnable retrieval module and a more efficient reranker than massive-scale open-source models, we significantly reduce the carbon footprint and hardware requirements asso- ciated with training and deploying large-scale cita- tion systems. References Zafar Ali, Guilin Qi, Khan Muhammad, Pavlos Kefalas, and Shah Khusro. 2021. Global citation recommen- dation employing generative adversarial network. Ex- pert Systems with Applications, 180:114888. Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- ERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 3615– 3620, Hong Kong, China. Association for Computa- tional Linguistics. Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar. 2018. Content-based citation recommendation. In Proceedings of the 2018 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages 238–251, New Orleans, Louisiana. Association for Computational Linguistics. Ege Yi ̆ git Çelik and Selma Tekir. 2025. CiteBART: Learning to generate citations for local citation rec- ommendation. In Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 1703–1719, Suzhou, China. Asso- ciation for Computational Linguistics. Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024.M3- embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self- knowledge distillation. In Findings of the Asso- ciation for Computational Linguistics: ACL 2024, pages 2318–2335, Bangkok, Thailand. Association for Computational Linguistics. Tao Dai, Li Zhu, Yaxiong Wang, and Kathleen M Car- ley. 2019. Attentive stacked denoising autoencoder with bi-lstm for personalized context-aware citation recommendation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:553–568. Priyangshu Datta, Suchana Datta, and Dwaipayan Roy. 2024. Raging against the literature: Llm-powered dataset mention extraction. In Proceedings of the 24th ACM/IEEE Joint Conference on Digital Li- braries, pages 1–12. John A Drozdz and Michael R Ladomery. 2024. The peer review process: past, present, and future. British Journal of Biomedical Science, 81:12054. Travis Ebesu and Yi Fang. 2017. Neural citation net- work for context-aware citation recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in informa- tion retrieval, pages 1093–1096. Karan Goyal, Mayank Goel, Vikram Goyal, and Mukesh Mohania. 2024. SymTax: Symbiotic relationship and taxonomy fusion for effective citation recom- mendation. In Findings of the Association for Com- putational Linguistics: ACL 2024, pages 8997–9008, Bangkok, Thailand. Association for Computational Linguistics. Nianlong Gu, Yingqiang Gao, and Richard HR Hahn- loser. 2022. Local citation recommendation with hierarchical-attention text encoder and scibert-based reranking. In European Conference on Information Retrieval, pages 274–288. Springer. Qi He, Jian Pei, Daniel Kifer, Prasenjit Mitra, and Lee Giles. 2010. Context-aware citation recommendation. In Proceedings of the 19th international conference on World wide web, pages 421–430. Wenyi Huang, Zhaohui Wu, Chen Liang, Prasenjit Mi- tra, and C Giles. 2015. A neural probabilistic model for context based citation recommendation. In Pro- ceedings of the AAAI Conference on Artificial Intelli- gence, volume 29. Chanwoo Jeong, Sion Jang, Eunjeong Park, and Sungchul Choi. 2020. A context-aware citation rec- ommendation model with bert and graph convolu- tional networks. Scientometrics, 124:1907–1922. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirec- tional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2. Thomas N. Kipf and Max Welling. 2017.Semi- supervised classification with graph convolutional networks. In International Conference on Learning Representations. Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia Shao. 2023. Making large language models a better founda- tion for dense retrieval. Preprint, arXiv:2312.15503. Avishay Livne, Vivek Gokuladas, Jaime Teevan, Su- san T Dumais, and Eytan Adar. 2014. Citesight: supporting contextual citation recommendation us- ing differential search. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 807–816. Zoran Medi ́ c and Jan Šnajder. 2020. Improved local citation recommendation based on context enhanced with global information. In Proceedings of the first workshop on scholarly document processing, pages 97–103. Ping Ni, Xianquan Wang, Bing Lv, and Likang Wu. 2024. Gtr: An explainable graph topic-aware rec- ommender for scholarly document. Electronic Com- merce Research and Applications, 67:101439. Malte Ostendorff, Nils Rethmeier, Isabelle Augenstein, Bela Gipp, and Georg Rehm. 2022. Neighborhood contrastive learning for scientific document represen- tations with citation embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11670–11688, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Ronald Rousseau, Carlos Garcia-Zorita, and Elias Sanz- Casado. 2023. Publications during covid-19 times: An unexpected overall increase. Journal of Informet- rics, 17(4):101461. Amanpreet Singh, Mike D’Arcy, Arman Cohan, Doug Downey, and Sergey Feldman. 2022. Scirepeval: A multi-format benchmark for scientific document rep- resentations. In Conference on Empirical Methods in Natural Language Processing. Qianqian Xie, Yutao Zhu, Jimin Huang, Pan Du, and Jian-Yun Nie. 2021. Graph neural collaborative topic model for citation recommendation. ACM Transac- tions on Information Systems (TOIS), 40(3):1–30. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176. A Appendix A.1 Analysis Quantitative Analysis. To analyse the sensitiv- ity of our reranker to the candidate pool size, we present a quantitative analysis varying the num- ber of candidates (k) to be reranked. While the main experiments in this paper are conducted with k=300, Table 6 details the performance variation at kMRRRecall@KNDCG@K 5102051020 ACL-200 5046.97 59.62 67.02 71.75 49.04 51.46 52.67 100 48.92 62.08 70.36 76.46 50.92 53.62 55.17 300 50.31 64.10 73.08 80.20 52.30 55.22 57.03 1000 50.92 64.86 74.31 81.73 52.84 55.91 57.79 FullTextPeerRead 5055.03 68.60 75.14 79.35 57.48 59.61 60.68 100 57.69 72.01 79.25 83.87 60.19 62.54 63.72 300 59.68 74.41 82.17 87.42 62.16 64.68 66.02 1000 60.20 75.14 82.84 88.43 62.71 65.22 66.64 Refseer 5028.78 37.47 43.61 48.73 29.96 31.95 33.25 100 30.60 39.77 46.55 52.66 31.70 33.90 35.45 300 32.57 42.19 49.52 56.37 33.62 36.00 37.73 1000 32.74 42.01 49.11 55.89 33.68 35.98 37.70 arXiv 5027.24 37.12 45.01 51.55 28.44 31.00 32.66 100 28.85 39.02 47.65 55.33 29.89 32.69 34.64 300 30.46 40.86 49.89 58.50 31.38 34.31 36.49 1000 30.06 39.82 48.62 56.97 30.81 33.66 35.78 ArSyTa 5021.51 30.52 37.73 43.67 22.60 24.94 26.45 100 22.96 32.47 40.60 47.84 23.93 26.57 28.40 300 24.01 33.74 42.83 51.56 34.73 27.67 29.89 1000 20.74 29.24 38.37 47.98 21.02 23.96 26.39 Table 6: Analysis showing the impact of number of candidates (k) on reranking performance. We found the value of 300 as an overall better choice for the final reranking performance with respect to the metrics and the computational overhead. different values of k. The results reveal two distinct trends: on smaller datasets, performance scales positively with k; however, on larger datasets, per- formance peaks around k=300 and subsequently degrades. This degradation suggests that process- ing too many low-quality candidates introduces noise that can harm the reranker’s precision. Given that computational cost also grows linearly with k, this analysis confirms that k=300 represents an optimal trade-off, maximising performance while avoiding the dual penalties of increased noise and computational overhead. Qualitative Analysis.To complement our quan- titative results and provide deeper insight into the mechanisms driving our model’s performance, we conduct a qualitative case study. By manually in- specting the recommendations for a representative query, we can better understand how our system compares with the state-of-the-art citation recom- mendation systems, as shown in Table 7. We se- lect a query paper from our test set whose topic is nuanced and requires a deep understanding of the semantics. The SOTA models demonstrate a classic failure mode of relying on broad and super- ficial topic matching. They correctly identify the general topic of ‘Machine Translation’ but com- pletely misses the critical and specific usage of the term ‘MERT’. Instead they focus on another term ‘Moses’ from both the citation context and the query abstract, and use these two signals to recom- mend from the candidate pool. On the other hand, our system also identify the same general topic of ‘Machine Translation’ but intelligently picking up the abbreviated term ‘MERT’ and using it effec- tively for recommending from the retrieved candi- date set by comparing it with their titles. A.2 Implementation Details Our experimental pipeline is designed to reflect the distinct computational profiles of retrieval and reranking. The coarse-grained retrieval stages for all systems are executed on NVIDIA A100 DGX clusters. The more computationally intensive, fine- grained reranking stages utilise NVIDIA H200 DGX systems to ensure efficient processing. Given the substantial scale of the corpora and datasets, conducting multiple full training runs is computa- tionally prohibitive. To ensure the robustness of our findings, we first perform a stability analysis. We conduct three training trials on representative sub- sets of the training data and observed minimal vari- ance in performance, confirming the numerical sta- bility of our training procedure. Consequently, the final results reported in all tables are from a single, comprehensive run on the full-scale datasets. To support open science and ensure full reproducibil- ity, we are committed to a comprehensive release upon acceptance. This will include the complete source code, detailed hyperparameter configura- tions for all experiments, and the pre-trained model checkpoints for each dataset. This will facilitate further research and allow the community to readily apply our models to similar reranking tasks. To provide a multi-faceted assessment of ranking performance, we employ a suite of standard infor- mation retrieval metrics. We measure Recall@K for different values of K to evaluate the fraction of queries for which the correct citation is found within the top-K recommendations. To assess the quality of the ranking order, we use Mean Recipro- cal Rank (MRR), which rewards systems for plac- ing the correct item higher in the list by returning the average of the reciprocal ranks of the correct Citation Context:- “lation, phrases are extracted from this synthetic corpus and added as a separate phrase table to the combined system (CH1). The relative importance of this phrase table is estimated in standard MERT ( TARGETCIT) . The final translation of the test set is produced by Moses (enriched with this additional phrase table) and additionally post-processed by Depfix. Note that all components of this combination have d" Query Title:- What a Transfer-Based System Brings to the Combination with PBMT. Query Abstract:-We present a thorough analysis of a combination of a statistical and a transferbased system for En- glish→Czech translation, Moses and TectoMT. We describe several techniques for inspecting such a system combination which are based both on automatic and manual evaluation. While TectoMT often produces bad translations, Moses is still able to select the good parts of them. In many cases, TectoMT provides useful novel translations which are otherwise simply unavailable to the statistical component, despite the very large training data. Our analyses confirm the expected behaviour that TectoMT helps with preserving grammatical agreements and valency requirements, but that it also improves a very diverse set of other phenomena. Interestingly, including the outputs of the transfer-based system in the phrase-based search seems to have a positive effect on the search space. Overall, we find that the components of this combination are complementary and the final system produces significantly better translations than either component by itself. #HAtten recommendationSymTax recommendationOurs recommendation 1Moses: Open Source Toolkit for Statistical Machine Translation Moses: Open Source Toolkit for Statistical Machine Translation Minimum Error Rate Training in Statisti- cal Machine Translation 2Combining Multi-Engine Translations with Moses Findings of the 2012 Workshop on Statistical Machine Translation Statistical Phrase-Based Translation 3SMT and SPE Machine Translation Systems for WMT’09 A STATISTICAL APPROACH TO MA- CHINE TRANSLATION Moses: Open Source Toolkit for Statistical Machine Translation 4MANY: Open Source MT System Combina- tion at WMT’10 Combining Multi-Engine Translations with Moses Findings of the 2012 Workshop on Statistical Machine Translation 5Edinburgh’s Machine Translation Systems for European Language Pairs Phrasetable Smoothing for Statistical Ma- chine Translation Improved Statistical Alignment Models 6Toward Using Morphology in French-English Phrase-based SMT Minimum Error Rate Training in Statisti- cal Machine Translation A STATISTICAL APPROACH TO MA- CHINE TRANSLATION 7Parallel Implementations of Word Alignment Tool Training Phrase Translation Models with Leaving-One-Out Improvements in Phrase-Based Statistical Ma- chine Translation 8Improved Alignment Models for Statistical Machine Translation SMT and SPE Machine Translation Systems for WMT’09 Combining Multi-Engine Translations with Moses 9Investigations on Translation Model Adapta- tion Using Monolingual Data Statistical Phrase-Based TranslationHierarchical Phrase-Based Translation 10A STATISTICAL APPROACH TO MA- CHINE TRANSLATION MANY: Open Source MT System Combina- tion at WMT’10 Phrasetable Smoothing for Statistical Ma- chine Translation Table 7: Case study of citation recommendations for a sample from the ACL-200 dataset. The table contrasts the top-10 predictions from SOTA baseline models against our system, with ground-truth citation highlighted in bold to illustrate our model’s improved relevance. We can see that our model is successfully able to predict the correct citation by checking the abbrevation ‘MERT’ against the titles of the available candidates whereas the other systems just focus on the term (‘Moses’) in the abstract of the citing paper and the citation context, and use it for checking. # denotes the rank of the recommended citations. recommendations, and Normalised Discounted Cu- mulative Gain (NDCG@K), which similarly pro- vides a greater reward for correct items ranked at the very top by applying a logarithmic discount to the relevance of items based on their position. For all metrics, we report the average over all test queries in percentage, where higher values indicate better performance, consistent with the established literature. A.3 Massive-Scale Open-Source Rerankers For a rigorous comparison against the state-of-the- art, we select two notable massive-scale, general- purpose foundation models for reranking. These models represent the current paradigm of training extremely large transformers on diverse, web-scale data to create powerful, zero-shot text understand- ing capabilities. Their inclusion establishes clear and challenging baselines, allowing us to evaluate the performance of our specialised, task-specific model against these large generalist systems. Qwen Reranker Series.The Qwen model series, developed by Alibaba Cloud, represents a signif- icant advancement in open-source language mod- els. The rerankers from this series are specifically fine-tuned for relevance ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of rerank- ing models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reason- ing skills of its foundational model. The Qwen rerankers based on a powerful transformer archi- tecture are trained on massive datasets of query- document pairs, learning to discern subtle rele- vance signals far beyond simple keyword matching. As instruction-tuned models, they operate as cross- encoders that expect a structured prompt. The model ingests the query and document by embed- ding them within a specific template that defines the task. This allows for deep, token-level interaction between the query and the document, conditioned on the explicit instruction. The model is trained to output a single logit, where a higher value indicates a higher probability of relevance. We select the Qwen reranker as it is widely regarded as a state- of-the-art, general-purpose reranker. Its strong per- formance across various public benchmarks makes it a formidable baseline to measure against. We use the latest and the largest available open-source version, Qwen3-Reranker-8B having 8.19B param- eters, from the Qwen series for our experiments. BGE Reranker v2 (BAAI General Embedding). The BGE model family, released by the Beijing Academy of Artificial Intelligence (BAAI), is an- other highly influential series of models optimised for text retrieval and ranking. The BGE-Reranker- v2 is particularly notable for its excellent perfor- mance and efficiency. The BGE reranker is also a cross-encoder based on a transformer architec- ture. It has been fine-tuned on a mixture of public and proprietary datasets specifically for relevance ranking. The model architecture, often based on efficient backbones like minicpm, is designed to deliver high performance without the prohibitive computational cost of the largest models. The lay- erwise aspect in some variants refers to advanced techniques that leverage representations from mul- tiple transformer layers, which can enhance per- formance. The usage is identical to that of Qwen where it ingests a query, document pair and pro- cesses it through its transformer layers. It outputs a relevance logit, which is used to re-sort the can- didates. BGE models are known for their strong performance on standardised retrieval benchmarks like the MTEB (Massive Text Embedding Bench- mark). We employ the bge-reranker-v2-minicpm- layerwise having 40 layers and 2.72B parameters to provide another strong, publicly available base- line from a different lineage than Qwen. Its high ranking on public leaderboards and widespread adoption in the community make it an essential point of comparison for any new reranking model. Figure 4: Python functions for constructing the query and candidate document strings from the available raw data. Thecreate_query_from_citationfunction combines the citation context with metadata from the citing paper, whilecreate_document_from_paperfor- mats the candidate paper’s information. Implementation and Usage. To ensure a fair and direct comparison, we follow a consistent protocol for all baseline models. The pre-trained checkpoints for both the Qwen and BGE rerankers are loaded directly from the Hugging Face Hub. For each query-document pair, we use the specific instruction-based prompt formats recommended for Qwen and bge, respectively. For Qwen, an instruction, the query, and the candidate document text are combined into a single string template: “<Instruct>: instruction <Query>: query <Document>: doc".For the instructionplaceholder, we curate a clear task description as suggested in the Qwen guidelines: “Given a citation context and citing paper information, determine if the candidate paper is relevant to be cited in this context". The sequences are truncated to the models’ maximum input length. For bge, we follow its guidelines by choosing its recommended bge specific prompt:“Given a query A and a passage B, determine whether the passage contains an answer to the query by providing a prediction of either ‘Yes’ or ‘No’" . We describe the query and candidate document construction using the functions as shown in Figure 4. We run the models in inference mode on our same evaluation sets. For each formatted input, we extract the raw logit output before any final acti- vation. This logit is used directly as the relevance score for reranking. To reiterate, the same set of retrieved candidates and the same evaluation met- rics used for our own system are applied to these baselines to maintain experimental consistency. 0.00 0.20 0.40 0.60 0.80 1.00 (Context Parameter) 0.00 0.20 0.40 0.60 0.80 1.00 (Meta Parameter) 0.160 0.180 0.200 0.220 0.240 0.260 0.280 0.300 0.320 MRR FullTextPeerRead Optimal Parameters: = 0.800 = 0.200 MRR = 0.3206 0.00 0.20 0.40 0.60 0.80 1.00 (Context Parameter) 0.00 0.20 0.40 0.60 0.80 1.00 (Meta Parameter) 0.160 0.180 0.200 0.220 0.240 0.260 0.280 0.300 0.320 MRR ACL-200 Optimal Parameters: = 0.800 = 0.300 MRR = 0.3160 0.200 0.220 0.240 0.260 0.280 0.300 MRR (a) MRR Landscape 0.00 0.20 0.40 0.60 0.80 1.00 (Context Parameter) 0.00 0.20 0.40 0.60 0.80 1.00 (Meta Parameter) 0.350 0.400 0.450 0.500 0.550 0.600 Recall@10 FullTextPeerRead Optimal Parameters: = 0.800 = 0.300 Recall@10 = 0.5741 0.00 0.20 0.40 0.60 0.80 1.00 (Context Parameter) 0.00 0.20 0.40 0.60 0.80 1.00 (Meta Parameter) 0.350 0.400 0.450 0.500 0.550 0.600 Recall@10 ACL-200 Optimal Parameters: = 0.800 = 0.300 Recall@10 = 0.5640 0.400 0.420 0.440 0.460 0.480 0.500 0.520 0.540 0.560 Recall@10 (b) Recall@10 Landscape 0.00 0.20 0.40 0.60 0.80 1.00 (Context Parameter) 0.00 0.20 0.40 0.60 0.80 1.00 (Meta Parameter) 0.200 0.225 0.250 0.275 0.300 0.325 0.350 0.375 NDCG@10 FullTextPeerRead Optimal Parameters: = 0.800 = 0.200 NDCG@10 = 0.3700 0.00 0.20 0.40 0.60 0.80 1.00 (Context Parameter) 0.00 0.20 0.40 0.60 0.80 1.00 (Meta Parameter) 0.200 0.225 0.250 0.275 0.300 0.325 0.350 0.375 NDCG@10 ACL-200 Optimal Parameters: = 0.800 = 0.300 NDCG@10 = 0.3656 0.240 0.260 0.280 0.300 0.320 0.340 0.360 NDCG@10 (c) NDCG@10 Landscape Figure 5: Navigating the performance landscape of the public profile enrichment on the FullTextPeerRead and ACL- 200 validation sets. Each plot shows a different evaluation metric: (a) MRR, (b) Recall@10, and (c) NDCG@10.