← Back to papers

Paper deep dive

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish, Sung Eun Kim, Oishi Banerjee, Pranav Rajpurkar

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 37

Abstract

Abstract:We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at this https URL.

Tags

ai-safety (imported, 100%)cscl (suggested, 92%)preprint (suggested, 88%)safety-evaluation (suggested, 80%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/13/2026, 12:17:20 AM

Summary

CRIMSON is a clinically-grounded, LLM-based evaluation framework for chest X-ray report generation. It improves upon existing metrics by incorporating patient context (age, indication), using a severity-aware error taxonomy, and weighting findings based on clinical significance. Validated against radiologist annotations in ReXVal and new benchmarks RadJudge and RadPref, it demonstrates superior alignment with expert clinical judgment.

Entities (5)

CRIMSON · evaluation-framework · 100%MedGemma · model · 95%RadJudge · benchmark · 95%RadPref · benchmark · 95%ReXVal · benchmark · 95%

Relation Signals (3)

CRIMSON evaluates Chest X-ray report

confidence 98% · CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation

CRIMSON validatedby ReXVal

confidence 95% · CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal

MedGemma generates CRIMSON predictions

confidence 90% · fine-tune MedGemma to generate CRIMSON predictions

Cypher Suggestions (2)

Find all benchmarks associated with the CRIMSON framework · confidence 90% · unvalidated

MATCH (e:Entity {name: 'CRIMSON'})-[:VALIDATED_BY]->(b:Benchmark) RETURN b.name

Identify models used to generate CRIMSON metrics · confidence 85% · unvalidated

MATCH (m:Model)-[:GENERATES]->(o:MetricOutput {name: 'CRIMSON predictions'}) RETURN m.name

Full Text

36,805 characters extracted from source content.

Expand or collapse full text

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation Mohammed Baharoon Department of Biomedical Informatics, Harvard Medical School, Boston, MA These authors contributed equally Thibault Heintz Department of Radiation Oncology, Mass General Brigham, Boston, MA These authors contributed equally Siavash Raissi Department of Biomedical Informatics, Harvard Medical School, Boston, MA These authors contributed equally Mahmoud Alabbad King Fahad Hospital, Al-Ahsa Health Cluster, Al Hofuf, Saudi Arabia Mona Alhammad Ras-Tanura General Hospital, Ministry of Health, Eastern Province, Saudi Arabia Hassan AlOmaish Department of Medical Imaging, King Abdulaziz Medical City, Ministry of National Guard, Riyadh, Saudi Arabia Sung Eun Kim Department of Biomedical Informatics, Harvard Medical School, Boston, MA Oishi Banerjee Department of Biomedical Informatics, Harvard Medical School, Boston, MA Pranav Rajpurkar Department of Biomedical Informatics, Harvard Medical School, Boston, MA Abstract We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendall’s τ=0.61τ=0.61–0.710.71; Pearson’s r=0.71r=0.71–0.840.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass–fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1–5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON. $ $$ $footnotetext: Contact: MohammedSalimAB@outlook.com 1 Introduction Automated radiology report generation has advanced rapidly with the emergence of large vision-language models, yet reliable evaluation remains a fundamental challenge [23, 19, 7]. Recent radiology-specific metrics have moved beyond surface-level text similarity and instead assess factual correctness through structured error counting and finding-level comparison [11, 25, 5, 24, 17, 3]. These approaches represent important progress toward clinically meaningful evaluation by explicitly detecting hallucinations and omissions. Despite these advances, current metrics largely treat detected errors as either uniformly important or binary (significant vs. not significant), and evaluate findings in relative isolation from broader clinical context. In practice, the clinical consequences of errors vary substantially. For instance, failing to report a life-threatening pneumothorax is categorically different from missing age-related aortic calcification. Moreover, the relevance and interpretation of findings depend on a patient’s age and indication. Existing evaluation frameworks do not explicitly encode clinical severity as a fine-grained spectrum; instead, they collapse findings into coarse categories, lack sufficient clinical context to determine true significance, or rely on LLM judgments without structured in-context guidelines [11, 5, 17]. Consequently, these frameworks conflate minor, clinically inconsequential discrepancies with omissions that directly impact patient safety. Figure 1: Representative RadJudge cases illustrating core design principles of CRIMSON. Top: Patient context sensitivity. The clinical impact of an omission (e.g., aortic atherosclerosis) varies by age and indication, and CRIMSON adjusts severity accordingly. Middle: Normal finding handling. CRIMSON does not reward mentioning normal findings, preventing score inflation. Bottom: Clinical significance weighting. Errors are weighted by consequence, prioritizing clinically important findings. In each case, CRIMSON aligns with radiologist expectations, whereas prior metrics fail. To address these limitations, we introduce CRIMSON, a clinically grounded LLM-based evaluation framework designed to align automated assessment with real-world radiologic reasoning. CRIMSON evaluates reports at the level of individual findings while incorporating full clinical context, including patient age and indication. The framework models a comprehensive taxonomy of errors—false findings, missing findings, attribute-level errors (e.g., location, severity, measurement) and significance labels (urgent, actionable non-urgent, non-actionable, or expected/benign)—defined according to a rubric developed with cardiothoracic radiologists. The severity labels determine weights within a principled scoring formulation that prioritizes clinically consequential errors over benign discrepancies and supports partial credit for partially correct findings under fine-grained attribute rules. We validate CRIMSON through alignment with radiologist-annotated clinically significant error counts in ReXVal [20, 19] (PhysioNet Credentialed Health Data License 1.5.0). We also introduce and validate the metric using RadJudge, a targeted ranking test suite of clinically challenging scenarios, and RadPref, a comprehensive radiologist preference benchmark, demonstrating improved agreement with expert judgment compared to existing metrics. To facilitate reproducibility and adoption, we publicly release the metric and additionally fine-tune MedGemma [13] to generate CRIMSON predictions, enabling privacy-preserving, fully local deployment for hospitals and institutions without transmitting patient reports to external APIs. 2 Related Work Radiology Report Evaluation. Early evaluation of radiology report generation relied on general-purpose metrics such as BLEU [12] and ROUGE [9], which measure lexical overlap and are poorly aligned with clinical correctness. Radiology-specific frameworks such as CheXbert [15] and RadGraph [6] shift evaluation toward embedding-based or entity-based comparisons; however, they remain constrained by predefined label spaces or extraction pipelines and do not explicitly model variation in clinical severity across findings. RadCliQ [19] combines multiple automated metrics to approximate radiologist judgment through a composite score. RaTEScore [25] evaluates reports using entity-aware semantic similarity to better align with clinical content. Both metrics, however, lack explicit finding-level interpretability and explanation, and do not incorporate clinical context to determine the severity of errors. LLM-Based Evaluation Frameworks. More recently, evaluation methods have leveraged large language models (LLMs) to assess factual correctness through error categorization and counting [11, 17, 5, 8]. Compared to label- or graph-based approaches, LLM-based evaluators offer greater flexibility in identifying nuanced discrepancies. Additionally, these frameworks can produce structured rationales alongside error counts, improving interpretability and transparency of the evaluation process. However, most existing LLM-based metrics primarily focus on detecting and counting errors and do not explicitly incorporate severity-aware weighting or full clinical context [11, 17, 8, 24]. FineRadScore [5] introduces error weighting within a line-level correction framework, however, its severity assessment remains implicitly determined by the LLM without explicit clinician-defined guidelines or comprehensive patient-level context. 3 CRIMSON: Clinically-Grounded Report Evaluation CRIMSON evaluates a candidate radiology report against a reference report by identifying and categorizing discrepancies at the finding level, weighing each error by its clinical significance, and computing a normalized score that reflects clinical preferences. GPT-5.2 [14] is used as the backbone throughout this pipeline. The framework operates in three stages: (1) finding extraction and clinical significance assignment, (2) error detection and classification, and (3) severity-aware score computation. 3.1 Finding Extraction and Clinical Significance Assignment Given a reference report RrefR_ref and a candidate report RcandR_cand, CRIMSON first extracts all abnormal findings from each report. Normal findings are excluded from evaluation because including them can introduce spurious variability due to stylistic differences across radiologists [2, 1]. Each extracted finding f is assigned a clinical significance weight w​(f)w(f) based on standard radiological practice and a structured severity framework adapted from [16] with input from attending cardiothoracic radiologists. The clinical significance weight w​(f)w(f) is defined according to the following: w​(f)=1.0if urgent0.5if actionable, not urgent0.25if not actionable, not urgent0.0if expected/benignw(f)= cases1.0&if urgent\\ 0.5&if actionable, not urgent\\ 0.25&if not actionable, not urgent\\ 0.0&if expected/benign cases (1) Findings classified as urgent correspond to abnormalities requiring immediate intervention or indicating life-threatening conditions, such as a tension pneumothorax. Actionable non-urgent findings are those that alter patient management but are not immediately critical, including nodules, moderate pleural effusions, or consolidations. Not actionable are findings that represent minimal clinical impact but remain worth documenting, such as a cervical rib or appropriately positioned support devices. Expected benign findings include expected or age-appropriate changes with no impact on care, such as degenerative spine changes or a tortuous aorta. We separate non-actionable from expected benign findings because reporting of expected benign findings varies substantially by radiologist style; penalizing based on them introduces unnecessary randomness. Clinical significance assignment incorporates patient context when available, including age and indication. For example, aortic calcification in a 75-year-old patient is classified as expected benign, whereas the same finding in a 25-year-old patient may be considered actionable non-urgent due to atypical early onset. 3.2 Error Taxonomy and Classification CRIMSON characterizes discrepancies through three primary error categories: false findings, missing findings, and attribute errors. False findings are abnormal findings present in RcandR_cand but absent from RrefR_ref, representing a hallucination. Missing findings are abnormal findings present in RrefR_ref but absent from RcandR_cand, representing diagnostically meaningful omissions. Findings that appear in both reports are considered “matched.” For such findings, CRIMSON evaluates attribute-level correctness across eight dimensions: (1) anatomical location or laterality, (2) severity or extent, (3) morphological descriptors, (4) quantitative measurements, (5) certainty level, (6) diagnostic underinterpretation, (7) overinterpretation, and (8) temporal or comparison descriptors. Each attribute error e is assigned a severity-based weight wattr​(e)w_attr(e), where wattr​(e)=0.5w_attr(e)=0.5 if the error is labeled as significant, and wattr​(e)=0.0w_attr(e)=0.0 if it is labeled as negligible. Significant attribute errors are those that could alter treatment decisions or patient management, whereas negligible errors correspond to clinically inconsequential differences. For example, incorrect lung laterality is considered significant, whereas positional differences within the same lobe, such as ‘apical’ vs ‘lateral’ are considered negligible. For pulmonary nodules smaller than 6 m, measurement discrepancies exceeding 2 m are considered significant; for nodules 6 m or larger, discrepancies exceeding 4 m are considered significant, reflecting established practice [10]. Changes in severity descriptors that affect urgency, such as “small” versus “large,” are significant, whereas as “small” versus “tiny” are negligible. A single matched finding may contain multiple attribute errors, each evaluated independently. 3.3 Severity-Aware Scoring The framework produces a score in the range of (−1,1](-1,1] that can be easily interpreted in clinical workflows. The scale is grounded at 0, which corresponds to a normal candidate report. This reflects a practical assumption that a radiologist begins from a normal template and modifies the report by adding abnormal findings. A score greater than zero indicates that the candidate report contains more correct findings than errors after severity weighting. A score equal to zero indicates that the report is no more informative than submitting a normal template, except when the reference report is also normal, in which case a correct normal report receives a score of 11. A score less than zero indicates that the report contains more errors than correct findings, implying that a radiologist would likely spend more effort correcting it than editing a normal template. The upper bound of 11 represents a perfect report with no missed findings, no false positives, and no significant attribute errors. Negative scores approach −1-1 asymptotically because errors are theoretically unbounded: a candidate can always become worse by introducing additional false findings. For each matched finding mim_i, let its clinical significance weight be wi=w​(mi)w_i=w(m_i). Attribute-level penalties are aggregated as Eattr,i=∑jwattr​(ej,i)E_attr,i= _jw_attr(e_j,i), where ej,ie_j,i referes to attribute error j for finding i, and total credit, C, across matched findings is C=∑i∈matchedwi⋅wiwi+Eattr,iC= _i w_i· w_iw_i+E_attr,i. Let Wref=∑f∈Rrefw​(f)W_ref= _f∈ R_refw(f) denote the total weighted clinical significance of the reference report, and let Efalse=∑f∈falsew​(f)E_false= _f w(f) denote the weighted sum of false positive findings. The raw score is defined as: S=C−EfalseWrefif ​Wref>0,−Efalseif ​Wref=0​ and ​Efalse>0.1if ​Wref=0​ and ​Efalse=0,S= cases C-E_falseW_ref&if W_ref>0,\\[8.0pt] -E_false&if W_ref=0 and E_false>0.\\ 1&if W_ref=0 and E_false=0,\\[8.0pt] cases (2) To bound the negative range while preserving relative ordering, let A=Efalse−CA=E_false-C, which represents the excess weighted errors relative to correct findings. The final score is: CRIMSON=Sif ​S≥0−A1+Aif ​S<0CRIMSON= casesS&if S≥ 0\\ - A1+A&if S<0 cases (3) 4 Results We perform three complementary forms of validation: correlation with radiologist-annotated clinically significant error counts (Section 4.1), a radiologist-guided pass–fail clinical judgment test on RadJudge (Section 4.2), and large-scale radiologist preference alignment on RadPref (Section 4.3). 4.1 Correlation with Radiologist-Annotated Significant Errors We evaluated CRIMSON on 50 cases from ReXVal [20] annotated by six board-certified radiologists. We computed Kendall’s τ and Pearson r correlations between automatic metric scores and radiologist-derived clinically significant error counts. As shown in Table 1, CRIMSON demonstrates strong alignment with these expert annotations. Furthermore, error counts (E) exhibited even stronger alignment, and severity-weighted errors (Weighted E) achieved the highest correlations overall, demonstrating that explicitly modeling clinical consequence further improves agreement with expert judgment. Table 1: Kendall τ and Pearson r correlations (95% CI) between automatic metrics and radiologist-derived clinically significant error counts (n=50n=50). Columns refer to different candidate reports on ReXVal, each of which was chosen to optimize a specific metric [20]. GREEN E and CRIMSON E denote the total (unweighted) error count, while CRIMSON Weighted E applies clinical severity–based weighting to errors. All other metrics were calculated using RadEval [18]. ∗CRIMSON results are averaged across 5 runs due to non-deterministic API outputs. Correlation with Significant Error Counts Metric CheXbert BERTScore Kendall τ Pearson r Kendall τ Pearson r RadGraph [6] 0.41​[0.19,0.61]0.41\,[ 0.19,0.61] 0.59​[0.41,0.75]0.59\,[ 0.41,0.75] 0.54​[0.36,0.68]0.54\,[ 0.36,0.68] 0.65​[0.51,0.78]0.65\,[ 0.51,0.78] BLEU [12] 0.49​[0.30,0.65]0.49\,[ 0.30,0.65] 0.60​[0.47,0.72]0.60\,[ 0.47,0.72] 0.36​[0.16,0.54]0.36\,[ 0.16,0.54] 0.48​[0.32,0.63]0.48\,[ 0.32,0.63] BERTScore [21] 0.52​[0.35,0.67]0.52\,[ 0.35,0.67] 0.65​[0.52,0.78]0.65\,[ 0.52,0.78] 0.49​[0.30,0.66]0.49\,[ 0.30,0.66] 0.60​[0.44,0.74]0.60\,[ 0.44,0.74] GREEN [11] 0.62​[0.46,0.75]0.62\,[ 0.46,0.75] 0.75​[0.64,0.86]0.75\,[ 0.64,0.86] 0.67​[0.54,0.78]0.67\,[ 0.54,0.78] 0.70​[0.59,0.80]0.70\,[ 0.59,0.80] ROUGE-L [9] 0.58​[0.44,0.71]0.58\,[ 0.44,0.71] 0.71​[0.60,0.81]0.71\,[ 0.60,0.81] 0.54​[0.37,0.70]0.54\,[ 0.37,0.70] 0.62​[0.46,0.75]0.62\,[ 0.46,0.75] CheXbert [15] 0.46​[0.26,0.63]0.46\,[ 0.26,0.63] 0.45​[0.18,0.70]0.45\,[ 0.18,0.70] 0.30​[0.08,0.51]0.30\,[ 0.08,0.51] 0.34​[0.09,0.60]0.34\,[ 0.09,0.60] RaTEScore [25] 0.39​[0.17,0.57]0.39\,[ 0.17,0.57] 0.52​[0.31,0.69]0.52\,[ 0.31,0.69] 0.49​[0.32,0.65]0.49\,[ 0.32,0.65] 0.56​[0.37,0.73]0.56\,[ 0.37,0.73] RadCliQ-v1 [19] 0.34​[0.21,0.46]0.34\,[ 0.21,0.46] 0.34​[0.19,0.52]0.34\,[ 0.19,0.52] 0.35​[0.21,0.48]0.35\,[ 0.21,0.48] 0.35​[0.22,0.53]0.35\,[ 0.22,0.53] CRIMSON∗ 0.68​[0.54,0.79]0.68\,[ 0.54,0.79] 0.84​[0.76,0.90]0.84\,[ 0.76,0.90] 0.71​[0.60,0.80]0.71\,[ 0.60,0.80] 0.82​[0.74,0.89]0.82\,[ 0.74,0.89] GREEN E 0.71​[0.59,0.81]0.71\,[ 0.59,0.81] 0.75​[0.65,0.86]0.75\,[ 0.65,0.86] 0.75​[0.63,0.85]0.75\,[ 0.63,0.85] 0.85​[0.75,0.92]0.85\,[ 0.75,0.92] CRIMSON E 0.73​[0.61,0.83]0.73\,[ 0.61,0.83] 0.88​[0.79,0.94]0.88\,[ 0.79,0.94] 0.72​[0.62,0.80]0.72\,[ 0.62,0.80] 0.86​[0.77,0.93]0.86\,[ 0.77,0.93] CRIMSON Weighted E 0.78​[0.67,0.86]0.78\,[ 0.67,0.86] 0.90​[0.85,0.95]0.90\,[ 0.85,0.95] 0.80​[0.71,0.87]0.80\,[ 0.71,0.87] 0.91​[0.88,0.95]0.91\,[ 0.88,0.95] RadGraph BLEU Kendall τ Pearson r Kendall τ Pearson r RadGraph 0.59​[0.46,0.71]0.59\,[ 0.46,0.71] 0.60​[0.50,0.72]0.60\,[ 0.50,0.72] 0.64​[0.50,0.75]0.64\,[ 0.50,0.75] 0.75​[0.65,0.84]0.75\,[ 0.65,0.84] BLEU 0.13​[0.09,0.34]0.13\,[ 0.09,0.34] 0.23​[0.02,0.39]0.23\,[ 0.02,0.39] 0.52​[0.34,0.68]0.52\,[ 0.34,0.68] 0.67​[0.53,0.79]0.67\,[ 0.53,0.79] BERTScore 0.46​[0.29,0.61]0.46\,[ 0.29,0.61] 0.54​[0.39,0.68]0.54\,[ 0.39,0.68] 0.58​[0.41,0.72]0.58\,[ 0.41,0.72] 0.72​[0.60,0.83]0.72\,[ 0.60,0.83] GREEN 0.62​[0.48,0.74]0.62\,[ 0.48,0.74] 0.65​[0.53,0.77]0.65\,[ 0.53,0.77] 0.70​[0.55,0.83]0.70\,[ 0.55,0.83] 0.79​[0.68,0.89]0.79\,[ 0.68,0.89] ROUGE-L 0.54​[0.38,0.67]0.54\,[ 0.38,0.67] 0.60​[0.49,0.70]0.60\,[ 0.49,0.70] 0.67​[0.52,0.79]0.67\,[ 0.52,0.79] 0.80​[0.70,0.87]0.80\,[ 0.70,0.87] CheXbert 0.29​[0.08,0.48]0.29\,[ 0.08,0.48] 0.33​[0.08,0.55]0.33\,[ 0.08,0.55] 0.18​[0.05,0.40]0.18\,[ 0.05,0.40] 0.23​[0.04,0.47]0.23\,[ 0.04,0.47] RaTEScore 0.57​[0.42,0.70]0.57\,[ 0.42,0.70] 0.62​[0.45,0.78]0.62\,[ 0.45,0.78] 0.54​[0.39,0.68]0.54\,[ 0.39,0.68] 0.67​[0.54,0.79]0.67\,[ 0.54,0.79] RadCliQ-v1 0.12​[0.05,0.28]0.12\,[ 0.05,0.28] 0.06​[0.11,0.28]0.06\,[ 0.11,0.28] 0.28​[0.11,0.43]0.28\,[ 0.11,0.43] 0.16​[0.01,0.53]0.16\,[ 0.01,0.53] CRIMSON∗ 0.61​[0.45,0.75]0.61\,[ 0.45,0.75] 0.71​[0.53,0.85]0.71\,[ 0.53,0.85] 0.67​[0.54,0.79]0.67\,[ 0.54,0.79] 0.81​[0.71,0.89]0.81\,[ 0.71,0.89] GREEN E 0.71​[0.59,0.81]0.71\,[ 0.59,0.81] 0.75​[0.65,0.86]0.75\,[ 0.65,0.86] 0.80​[0.71,0.88]0.80\,[ 0.71,0.88] 0.88​[0.82,0.93]0.88\,[ 0.82,0.93] CRIMSON E 0.73​[0.61,0.83]0.73\,[ 0.61,0.83] 0.86​[0.78,0.92]0.86\,[ 0.78,0.92] 0.74​[0.61,0.84]0.74\,[ 0.61,0.84] 0.87​[0.78,0.93]0.87\,[ 0.78,0.93] CRIMSON Weighted E 0.77​[0.67,0.85]0.77\,[ 0.67,0.85] 0.86​[0.80,0.93]0.86\,[ 0.80,0.93] 0.78​[0.69,0.86]0.78\,[ 0.69,0.86] 0.88​[0.82,0.93]0.88\,[ 0.82,0.93] 4.2 Radiologist-Guided Clinical Judgment Test Figure 2: RadJudge results. For each case, metrics are evaluated based on whether their relative ranking of multiple candidate reports agrees with the expected ordering determined with agreement across three attending cardiothoracic radiologists. Each category contains three cases; entries are cases passed (out of 3), with totals out of 30. Figure 3: Radiologist Preference Alignment (RadPref). Correlation between metric score and radiologist rating differences across 100 pairwise cases. Each point corresponds to a case comparing two candidate reports for the same reference report. We also developed RadJudge, a targeted pass–fail test suite reflecting real-world radiologist intuition. RadJudge comprises 30 curated cases across 10 clinically nuanced categories in which multiple candidate reports are compared and independently reviewed by three cardiothoracic radiologists, with agreement required to establish the reference preference. A metric passes if it ranks reports in accordance with expert judgment or assigns equivalent scores when radiologists deem them clinically indistinguishable, defined as differences within a threshold of 0.01. Three representative cases are shown in Figure 1. The suite probes challenging scenarios, including urgent omissions versus benign hallucinations, context-dependent findings, diagnostic over- and under-interpretation, and situations reflecting the clinical reality of imperfect reference reports that omit localization or age-expected benign findings. As shown in Figure 2, CRIMSON is the only metric that correctly solves all 30 out of 30 cases, consistently ranking candidate reports in accordance with expert radiologist judgment. In contrast, all prior metrics perform substantially worse, correctly resolving fewer than 35% of cases, highlighting their limited ability to capture nuanced clinical judgment. 4.3 Radiologist Preference Alignment Correlation with error counts may not fully capture radiologist preference, as experts don’t weigh different types of errors equally in overall judgment. To directly assess preference alignment, we introduce RadPref, a benchmark of 100 cases, each containing a reference report from ReXGradient-160K [22] and two candidate reports randomly generated using diverse regimes: report generation with MedGemma [13], randomly sampled reports, BERT similarity–matched reports, and LLM-based editing, addition, or removal of findings. Each candidate was rated on a 1–5 scale by three cardiothoracic radiologists based on overall clinical quality and correctness relative to the reference report. The scale was defined as: 1 = completely wrong or clinically dangerous; 2 = major errors, with most key findings missing or false; 3 = partially correct, with some significant errors; 4 = mostly accurate, with only minor or negligible errors; and 5 = clinically equivalent to the ground truth. Scores were computed separately for both candidates. Figure 4: MedGemmaCRIMSON vs GPT-5.2. A) Mean absolute error across false findings, missing findings, and attribute errors per radiologist. B) Severity categorization confusion matrices between three radiologists and CRIMSON, computed only on matched errors (i.e., findings for which both the radiologist and CRIMSON identified an error in the same category). Titles show the percentage of cases for which the radiologist and CRIMSON agree on error category. Color intensity represents the within-row percentage. Figure 3 shows that CRIMSON demonstrates the strongest alignment with radiologist pairwise preferences among all evaluated metrics. Across Kendall’s τb _b and Pearson r, CRIMSON consistently outperforms prior approaches and approaches inter-rater radiologist agreement. These findings indicate that CRIMSON more faithfully reflects expert clinical preference in relative report quality. 4.4 MedGemma Fine-tuning and Analysis We fine-tuned MedGemma on GPT-5–generated CRIMSON annotations using the full ReXGradient-160K [22] training set of 140,000 reports pairs for 10 epochs using LoRA [4]. For each pair, the candidate report was generated using the same regimes described in Section 4.3 except without using any image-based generation. GPT-5 generated structured finding-level error labels and severity assignments, which served as supervision to train MedGemma to replicate CRIMSON-style outputs. Additional training details are provided on the model’s Hugging Face page. We compare the fine-tuned MedGemma (MedGemmaCRIMSON) against GPT-5.2 on RadPref preference alignment and severity categorization (Figure 4). Notably, RadPref provides not only pairwise preference ratings but also structured error categorization and clinical severity annotations across all three cardiothoracic radiologists, enabling direct evaluation of both preference alignment and clinically significant error modeling. MedGemmaCRIMSON achieves comparable mean absolute error to GPT-5.2 across false findings, missing findings, and attribute errors, with similar behavior particularly on attribute-level discrepancies.For clinical significance labeling, MedGemmaCRIMSON closely mirrors GPT-5.2 in reproducing radiologist-assigned categories across all three radiologists, achieving agreement rates that are slightly lower but within a narrow margin of GPT-5.2 (Radiologist 1: 80.3% vs 81.6%; Radiologist 2: 76.7% vs 80.5%; Radiologist 3: 73.5% vs 75.4%), with most disagreements occurring between adjacent severity levels rather than extreme misclassifications. 5 Discussion We introduce CRIMSON, a clinically grounded and severity-aware framework for fine-grained radiology report evaluation that explicitly models patient context, diagnostic consequence, and structured attribute-level errors. By incorporating clinician-defined clinical significance weights and score normalization, CRIMSON aligns automated evaluation more closely with real-world radiologist reasoning than prior approaches. While CRIMSON leverages GPT-5.2 for structureFd evaluation, we additionally demonstrate that a fine-tuned open-weight model (MedGemmaCRIMSON) can closely approximate its behavior, enabling privacy-preserving, fully local deployment. A core motivation of CRIMSON is the principle that generated reports should be evaluated according to how they would function under radiologist oversight, rather than solely through aggregate accuracy or raw error counts. Instead of treating all discrepancies equally or as binary (significant vs. not significant), CRIMSON explicitly models whether an error would be clinically consequential or potentially dangerous. Missing a life-threatening abnormality should dominate the evaluation, whereas minor descriptor differences should not. This principle also motivates CRIMSON’s partial-credit design for attribute errors. When a model correctly identifies a clinically important finding but misstates a secondary attribute (e.g., mild severity mismatch or imprecise localization), it may still provide value by directing the radiologist’s attention to the relevant abnormality. CRIMSON therefore rewards correct detection while penalizing clinically meaningful attribute mistakes in a severity-aware manner, reflecting that some errors increase downstream review effort without necessarily creating the same patient-safety risk as a complete omission or major hallucination. Across three complementary validation settings: (1) correlation with radiologist-annotated clinically significant error counts, (2) the RadJudge clinical judgment suite, and (3) the RadPref radiologist preference benchmark, CRIMSON consistently demonstrates stronger agreement with expert judgment than existing metrics. Notably, severity-weighted modeling further improves alignment, highlighting the importance of distinguishing clinically consequential errors from benign ones. A limitation of this work is that much of CRIMSON’s prompting framework, severity rubric, and structured evaluation guidelines were developed specifically for chest X-ray reports. The clinical significance taxonomy, attribute rules, and measurement thresholds were designed in collaboration with cardiothoracic radiologists and tailored to common CXR findings and reporting conventions. Although the underlying evaluation framework is modality-agnostic in principle, applying CRIMSON to other imaging domains will require adaptation of prompts, finding ontologies, and severity criteria to align with modality-specific clinical standards. Future work will extend CRIMSON beyond chest X-ray to additional imaging modalities where anatomical detail, multimodal context, and diagnostic complexity are considerably greater. Acknowledgments This research was supported in part by Harvard Medical School Dean’s Innovation Award for Accelerating Foundation Model Research. References [1] Baharoon, M., Ma, J., Fang, C., Toma, A., Wang, B.: Exploring the design space of 3d mllms for ct report generation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. p. 237–246. Springer (2025) [2] Baharoon, M., Raissi, S., Jun, J.S., Heintz, T., Alabbad, M., Alburkani, A., Kim, S.E., Kleinschmidt, K., Alhumaydhi, A.O., Alghamdi, M.M.G., et al.: Radgame: An ai-powered platform for radiology education. arXiv preprint arXiv:2509.13270 (2025) [3] Chaves, J.M.Z., Huang, S.C., Xu, Y., Xu, H., Usuyama, N., Zhang, S., Wang, F., Xie, Y., Khademi, M., Yang, Z., et al.: Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation. arXiv preprint arXiv:2403.08002 (2024) [4] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr 1(2), 3 (2022) [5] Huang, A., Banerjee, O., Wu, K., Reis, E.P., Rajpurkar, P.: Fineradscore: A radiology report line-by-line evaluation technique generating corrections with severity scores. arXiv preprint arXiv:2405.20613 (2024) [6] Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y., et al.: Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463 (2021) [7] Li, R., Li, J., Jian, B., Yuan, K., Zhu, Y.: Reevalmed: Rethinking medical report evaluation by aligning metrics with real-world clinical judgment. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. p. 11823–11837 (2025) [8] Li, Y., Liu, Y., Wang, Z., Liang, X., Liu, L., Wang, L., Zhou, L.: S-rrg-bench: Structured radiology report generation with fine-grained evaluation framework. Meta-Radiology p. 100171 (2025) [9] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. p. 74–81 (2004) [10] MacMahon, H., Naidich, D.P., Goo, J.M., Lee, K.S., Leung, A.N., Mayo, J.R., Mehta, A.C., Ohno, Y., Powell, C.A., Prokop, M., et al.: Guidelines for management of incidental pulmonary nodules detected on ct images: from the fleischner society 2017. Radiology 284(1), 228–243 (2017) [11] Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Bluethgen, C., Md, A.E.M., Moseley, M., Langlotz, C., Chaudhari, A.S., et al.: Green: Generative radiology report evaluation and error notation. In: Findings of the association for computational linguistics: EMNLP 2024. p. 374–390 (2024) [12] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. p. 311–318 (2002) [13] Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025) [14] Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) [15] Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.P.: Chexbert: combining automatic labelers and expert annotations for accurate radiology report labeling using bert. arXiv preprint arXiv:2004.09167 (2020) [16] Tian, K., Hartung, S.J., Li, A.A., Jeong, J., Behzadi, F., Calle-Toro, J., Adithan, S., Pohlen, M., Osayande, D., Rajpurkar, P.: Refisco: Report fix and score dataset for radiology report generation. PhysioNet (2023) [17] Wang, Z., Luo, X., Jiang, X., Li, D., Qiu, L.: Llm-radjudge: Achieving radiologist-level evaluation for x-ray report generation. arXiv preprint arXiv:2404.00998 (2024) [18] Xu, J., Zhang, X., Abderezaei, J., Bauml, J., Boodoo, R., Haghighi, F., Ganjizadeh, A., Brattain, E., Van Veen, D., Meng, Z., et al.: Radeval: A framework for radiology text evaluation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. p. 546–557 (2025) [19] Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E.P., Fonseca, E.K.U.N., Lee, H.M.H., Abad, Z.S.H., Ng, A.Y., et al.: Evaluating progress in automatic chest x-ray radiology report generation. Patterns 4(9) (2023) [20] Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E.P., Fonseca, E., Lee, H., Shakeri, Z., Ng, A., et al.: Radiology report expert evaluation (rexval) dataset (2023) [21] Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019) [22] Zhang, X., Acosta, J.N., Miller, J., Huang, O., Rajpurkar, P.: Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports. In: Machine Learning for Health 2025 [23] Zhang, X., Acosta, J.N., Yang, X., Adithan, S., Luo, L., Zhou, H.Y., Miller, J., Huang, O., Zhou, Z., Hamamci, I.E., et al.: Automated chest x-ray report generation remains unsolved. In: Biocomputing 2026: Proceedings of the Pacific Symposium. p. 236–250. World Scientific (2025) [24] Zhang, Z., Lee, K., Deng, W., Zhou, H., Jin, Z., Huang, J., Gao, Z., Marshall, D.C., Fang, Y., Yang, G.: Gema-score: Granular explainable multi-agent score for radiology report evaluation. arXiv preprint arXiv:2503.05347 (2025) [25] Zhao, W., Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Ratescore: A metric for radiology report generation. arXiv preprint arXiv:2406.16845 (2024)