Paper deep dive

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Bangju Han, Yingqi Wang, Huang Qing, Tiyuan Li, Fengyi Yang, Ahtamjan Ahmat, Abibulla Atawulla, Yating Yang, Xi Zhou

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 60

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/22/2026, 5:51:10 AM

Summary

The paper introduces CulT-Eval, a benchmark comprising over 7,959 instances designed to evaluate machine translation performance on culture-loaded expressions (idioms, slang, CSIs). It identifies that standard metrics like BLEU and COMET fail to capture culturally induced meaning deviations and proposes a complementary, taxonomy-aware evaluation metric to address these systematic failure modes in LLMs and NMT systems.

Entities (6)

CulT-Eval · benchmark · 100%Large Language Models · technology · 95%Machine Translation · task · 95%ACRE · metric · 90%BLEU · metric · 90%COMET · metric · 90%

Relation Signals (3)

CulT-Eval → contains → Culture-loaded expressions

confidence 95% · CulT-Eval comprises over 7,959 carefully curated instances spanning multiple types of culturally grounded expressions

CulT-Eval → evaluates → Machine Translation

confidence 95% · a benchmark designed to evaluate how models handle different types of culturally grounded expressions

ACRE → outperforms → BLEU

confidence 90% · Pearson and Spearman correlation coefficients between metrics and human annotations on CulT-Eval

Cypher Suggestions (2)

Find all benchmarks related to machine translation evaluation. · confidence 90% · unvalidated

MATCH (b:Benchmark)-[:EVALUATES]->(t:Task {name: 'Machine Translation'}) RETURN b

Identify metrics used to evaluate machine translation performance. · confidence 90% · unvalidated

MATCH (m:Metric)-[:EVALUATES]->(t:Task {name: 'Machine Translation'}) RETURN m

Abstract

Abstract:Culture-expressions, such as idioms, slang, and culture-specific items (CSIs), are pervasive in natural language and encode meanings that go beyond literal linguistic form. Accurately translating such expressions remains challenging for machine translation systems. Despite this, existing benchmarks remain fragmented and do not provide a systematic framework for evaluating translation performance on culture-loaded expressions. To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions. CulT-Eval comprises over 7,959 carefully curated instances spanning multiple types of culturally grounded expressions, with a comprehensive error taxonomy covering culturally grounded expressions. Through extensive evaluation of large language models and detailed analysis, we identify recurring and systematic failure modes that are not adequately captured by existing automatic metrics. Accordingly, we propose a complementary evaluation metric that targets culturally induced meaning deviations overlooked by standard MT metrics. The results indicate that current models struggle to preserve culturally grounded meaning and to capture the cultural and contextual nuances essential for accurate translation. Our benchmark and code are available at this https URL.

PDF

Open source PDF →Open local PDF →

Full Text

59,492 characters extracted from source content.

Expand or collapse full text

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation Bangju Han ∗†‡ hanbangju23@mails.ucas.ac.cn Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Yingqi Wang ∗†‡ wangyingqi23@mails.ucas.ac.cn Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Qing Huang †‡ huangqing24@mails.ucas.ac.cn Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Tiyuan Li †‡ Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Fengyi Yang †‡ Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Ahtamjan Ahmat †‡ Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Abibulla Atawulla †‡ Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Ran Bi †‡ Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Yating Yang †‡ Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Xi Zhou †‡§ Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Urumqi, China Abstract Culture-expressions, such as idioms, slang, and culture-specific items (CSIs), are pervasive in natural language and encode mean- ings that go beyond literal linguistic form. Accurately translating such expressions remains challenging for machine translation sys- tems. Despite this, existing benchmarks remain fragmented and do not provide a systematic framework for evaluating translation performance on culture-loaded expressions. To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions. ∗ Both authors contributed equally to this research. † University of Chinese Academy of Sciences, Beijing, China (second affiliation) ‡ Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi, China (third affiliation) § Corresponding author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Conference acronym ’X, Woodstock, NY © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-X-X/2018/06 https://doi.org/X.X CulT-Eval comprises over 7,959 carefully curated instances span- ning multiple types of culturally grounded expressions, with a comprehensive error taxonomy covering culturally grounded ex- pressions. Through extensive evaluation of large language mod- els and detailed analysis, we identify recurring and systematic failure modes that are not adequately captured by existing auto- matic metrics. Accordingly, we propose a complementary evalu- ation metric that targets culturally induced meaning deviations overlooked by standard MT metrics. The results indicate that cur- rent models struggle to preserve culturally grounded meaning and to capture the cultural and contextual nuances essential for accurate translation. Our benchmark and code are available at https://anonymous.4open.science/r/CulT-Eval-E75D/. CCS Concepts • General and reference→Evaluation;• Computing method- ologies→ Machine translation. Keywords Culture-loaded expressions, Machine translation evaluation, Cul- tural grounding, Error analysis, benchmark ACM Reference Format: Bangju Han, Yingqi Wang, Qing Huang, Tiyuan Li, Fengyi Yang, Ahtamjan Ahmat, Abibulla Atawulla, Ran Bi, Yating Yang, and Xi Zhou. 2018. From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding arXiv:2603.17303v1 [cs.CL] 18 Mar 2026 Conference acronym ’X, June 03–05, 2018, Woodstock, NYTrovato et al. in Machine Translation. In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym ’X). ACM, New York, NY, USA, 13 pages. https://doi.org/X.X 1 Introduction Rooted in shared cultural knowledge and social conventions, many expressions in natural language convey meanings that rely on im- plicit cultural knowledge rather than explicit linguistic forms. We refer to these expressions as culture-loaded expressions, such as idioms, slang, literary allusions, and culture-specific items (CSIs). Figure 1 presents representative instances across idioms, proverbs, and culture-specific items. Recent studies indicate that cultural grounding poses persistent challenges for modern large language models. Evaluations of LLMs consistently show that these mod- els struggle with culturally grounded expressions such as idioms and idiomatic language, whose meanings depend on shared cul- tural and historical background that is not explicitly encoded in the linguistic form[6,9]. These limitations directly affect translation quality: studies on interpretative slang and culture-specific items demonstrate that accurate translation often requires reconstruct- ing intended meaning under context, rather than relying on direct lexical mapping[16, 33]. Despite the growing awareness of these model limitations, cur- rent evaluation frameworks remain poorly equipped to detect or characterize culturally grounded translation errors. Standard auto- matic metrics such as BLEU and ChrF mainly reward surface lexical similarity, while learned metrics such as COMET often favor con- ventional phrasing and penalize stylistic variation, which makes them unreliable for evaluating culturally nuanced translations[8, 17]. As a result, figurative, cultural, and context dependent errors frequently go unnoticed. Translating idioms, resolving cultural am- biguity, or conveying implicit references across languages requires more than lexical fidelity, but these aspects are only weakly re- flected in standard evaluation signals[24,30]. For instance, literal translations of idioms that preserve surface form but lose figura- tive meaning may still score highly under BLEU[2]. In response, recent work has proposed more context sensitive and culturally informed evaluation methods, including challenge sets that isolate cases requiring pragmatic reasoning or shared cultural knowledge, as well as metrics tailored to specific phenomena such as idiomatic- ity, metaphor, or cultural references[28,33]. Still, these advances remain limited in scope, since most evaluation datasets focus on iso- lated phenomena and lack a unified structure for taxonomy driven analysis of culturally induced meaning deviations in translation outputs. To address these gaps, we introduce CulT-Eval, a benchmark de- signed to systematically evaluate machine translation performance on culture-loaded expressions. CulT-Eval provides curated source sentences and human reference translations, together with struc- tured annotations that enable diagnostic analysis beyond sentence- level accuracy. Crucially, CulT-Eval is coupled with a unified er- ror taxonomy that makes culturally induced meaning deviations explicit and measurable. Building on this taxonomy, we further propose a complementary evaluation metric that operationalizes these error categories and quantifies cultural meaning preservation beyond what standard automatic metrics can capture. Together, the benchmark and the metric form a coherent evaluation framework Term：迷途知返 Term_Type: Idiom Term_English: Retract from the wrong path Source_Sentence: 他犯了错误后迷途知返，避免了更大的损失。 Translation_Sentence: After making a mistake, he promptly retracted from the wrong path, avoiding greater losses. Term：好⻢不吃回头草 Term_Type: Proverb Term_English: A good man never looks back once he has moved on Source_Sentence: 好⻢不吃回头草，既然他已经离开了老窝，就决心在外面的世界闯荡下去。 Translation_Sentence: A good man never looks back once he has moved on, since he has left the nest, he has made a decision to make a living in the outside world. Term：站票 Term_Type: CSIs Term_English: standing-room-only tickets Source_Sentence:由于音乐会太受欢迎，我们只能买到站票。 Translation_Sentence: Because the concert was so popular, we could only get standing-room-only tickets. Figure 1: Representative CulT-Eval instances. for analyzing how translation systems handle culturally grounded meaning. Using CulT-Eval, we conduct extensive evaluations of machine translation specialist systems and large language mod- els, revealing systematic failure patterns that are not exposed by existing benchmarks or metrics. •We present CulT-Eval, a benchmark for evaluating machine translation of culture-loaded expressions, comprising over 7,900 carefully curated instances with structured coverage across diverse culturally grounded phenomena and diagnos- tic annotations. •Through extensive evaluations of neural machine transla- tion systems and large language models on CulT-Eval, we provide insights into recurring and systematic failure pat- terns in translating culturally grounded meaning, revealing challenges that are not exposed by existing benchmarks. • We further find that widely used automatic evaluation met- rics, such as BLEU and COMET, are insufficient for assessing culturally induced meaning deviations, motivating the use of a complementary, taxonomy-aware evaluation metric for more reliable analysis. 2 Related Work Culture-loaded Expressions in Machine Translation. Recent research has increasingly recognized the challenge of translat- ing culture-loaded expressions, leading to the creation of numer- ous specialized evaluation benchmarks. Existing resources cover a wide range of culture-loaded content, spanning idioms[15] and proverbs[29], slang[16]and social-media expressions[12], classi- cal poetry[4] and culture-specific items[33], as well as domain- anchored terminology including recipes[3]. Despite this progress, existing resources remain fragmented along multiple axes, they are often bounded to a single cultural domain or register, focus on iso- lated linguistic phenomena, and employ incompatible annotation From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine TranslationConference acronym ’X, June 03–05, 2018, Woodstock, NY CategoryCount Ratio Avg. Term Len. (Char) Avg. Context (%)Source (Zh)Target (En)Source (Zh) Linguistic2,51231.65.1329.4322.29 Social2,39930.13.2124.2223.72 Material1,59420.03.0622.9724.39 Ecological83310.52.8220.1322.66 Religious6217.83.1323.0124.22 Total / Avg. 7,959 100.03.7425.0923.33 Table 1: Statistics of the CulT-Eval benchmark. schemes and evaluation criteria, which limits cross-benchmark com- parability and cumulative analysis of culture-related meaning shifts in MT. These limitations motivate the need for a unified taxonomy that can serve as a shared interface to characterize culture-loaded expressions and align evaluation across domains and phenomena. Evaluation Metrics in Machine Translation. Alongside frag- mented resources, cultural evaluation is frequently conducted with general-purpose MT metrics that primarily capture surface overlap or holistic semantic similarity, including BLEU[19], ChrF++[20], BERTScore[34], COMET[23], and QE-based variants[11,13,22]. However, cultural meaning shifts often appear as dimension-specific errors, including culture-specific referent mismatch, loss of allusive meaning, and register or socio-pragmatic mismatch, which may not be reliably reflected by aggregate scores[5,32]. Overlap-based metrics can under-reward legitimate paraphrases or localization choices, while semantic-similarity and learned metrics may still fail to identify which cultural dimension is violated, limiting di- agnostic value and impeding cross-benchmark interpretability[21, 27]. These limitations highlight the need for evaluation that is taxonomy-grounded and dimension-aware, enabling consistent and fine-grained assessment of cultural meaning preservation across diverse benchmarks. 3 CulT-Eval Benchmark This section presents CulT-Eval, a benchmark designed to eval- uate machine translation of culture-loaded expressions. We first delineate the data sources (§3.1), followed by the process of expres- sion identification and taxonomy-based classification (§3.2). We then detail the construction of human-verified references and the associated quality control procedures (§3.3). An overview of the CulT-Eval is illustrated in Figure 2. 3.1 Data Source CulT-Eval is constructed from two major domains of culturally rich Chinese-English parallel data, selected to ensure diverse registers and translation ambiguity. Literary and Narrative Archives. We selected bilingual excerpts from regional literature, folklore chronicles, and movie subtitles. This domain captures the expressive richness of the language, serv- ing as the primary source for idioms, slang, and Ecological/Material CSIs. By including both standardized idiomatic expressions and colloquial usage, this subset presents culturally nuanced phrasing that often requires interpretation beyond literal translation, and may or may not have established equivalents in the target language. MetricPearson 푟Spearman 휌 BLEU30.228.4 ChrF++22.420.1 BERTScore27.525.3 COMET44.539.0 MetricX-QE24.622.8 ACRE (Ours)68.465.1 Table 2: Pearson (푟) and Spearman (휌) correlation coefficients between metrics and human annotations on CulT-Eval. Public and Institutional Communication. We aggregated data from official publicity materials, news reports, and documentaries. This category focuses on formal and communicative precision, pro- viding a rich source for Social/Political CSIs and statutory terms. In contrast, the public/institutional subset focuses on consistent mappings of formalized terminology and policy-oriented expres- sions, which are typically expected to conform to existing bilingual conventions in cross-cultural communication. 3.2 Cultural Taxonomy To facilitate fine-grained diagnostic evaluation, each instance in CulT-Eval is annotated with a cultural category. We adopt a five- way taxonomy adapted from established frameworks in translation studies and intercultural communication [1,18]. This taxonomy systematizes the underlying cultural grounding into the following five dimensions: (1) Material Culture: Encompasses tangible artifacts, traditional attire, and architectural styles (e.g., Majiazi [马架子], a log shel- ter). (2) Social Culture: Pertains to sociopolitical systems, his- torical movements, and institutional roles (e.g., Red Tourism [红色旅游]). (3) Linguistic Culture: Covers idiomatic expressions, proverbs, and metaphors with non-compositional meanings (e.g., Three cobblers [三个臭皮匠]). (4) Religious Culture: Refers to belief systems, ritualistic practices, and philosophical frameworks (e.g., Confucianism). (5) Ecological Culture: Includes terms rooted in seasonal cycles, calendrical systems, and geography-based cosmo- logical concepts (e.g., Grain Rain [谷雨]). Each instance is assigned a single primary label based on its predominant contextual func- tion. To ensure diagnostic clarity, we prioritize mutually exclusive assignments even for expressions that exhibit categorical overlap. 3.3 Benchmark Construction Pipeline The construction of CulT-Eval followed a semi-automated pipeline combining LLM assistance with human annotation. LLM-Assisted Candidate Extraction. We employed GPT-5 to iden- tify candidate sentences from raw Chinese corpora likely to contain culture-loaded expressions. Source texts spanned the domains, in- cluding literary works, documentaries, movie subtitles. The model was prompted with domain-adapted instructions (see Table 6) to overgenerate potentially culture-specific content, prioritizing recall over precision. Conference acronym ’X, June 03–05, 2018, Woodstock, NYTrovato et al. Data Example CulT-Eval Colloquial Expression: The phrase "Life of human, all for gold" satirizes the phenomenon ... 甘为铜钱做牛⻢ “甘为铜钱做⻢牛”这句话讽刺了人们为追求钱财而丧失尊严、四处奔波的现象，...... Culture-Specific Items: 喘家 An asthma patient refers to someone who frequently suffers .... Linguistic Cultural Socail Cultural Model Translation Failures 既然他不仁，就别怪我在工作中给他穿小鞋了 Since he is unkind, don't blame me for wearing small shoes for ...... ......making things difficult for himLiteralization Senese Error 这种老房子没有集中供暖，全靠地下的地⻰来维持室温 ...underfloor heating flues... ........ Seven Error Types Material Cultural Social Cultural 5-Dimensional Cultural Taxonomy Linguistic Cultural Ecological Cultural Religious Cultural Source：Literature, Documentaries, Public Media Scale: 7959 Chinese-English pairs CulT-Eval Raw Text Data Construction Pipeline Literary and Narrative Archives Public Communication. 喘家是指常患喘证的人，在中医病因病机中需特别注意。 .......on underground earth dragons to maintain room temperature. Bilingual sentence pairs termterm_english sourece_sentence translation_sentence Human Annotation 1. Culture Label Annotation 2. Culture Meaning Annotation 3. Quality Contorl: (1) insufficient cultural salience (2) weak contextual support (3) semantic misalignment 4. Span-Level Check Figure 2: Overview of the CulT-Eval benchmark. Human Annotation and Cultural Term Labeling. All GPT-extracted candidates were reviewed by trained annotators to verify the pres- ence of at least one genuine culture-loaded expression. Sentences deemed culturally irrelevant or ambiguous were excluded. For each retained sentence, annotators identified the minimal span of the culture-specific term in both the Chinese source and the English target, and assigned a category label from the taxonomy defined in Section 3.2. In addition to span identification and classification, annotators enriched each instance with: (1) a verified English reference transla- tion drawn from the original bilingual source; (2) a cultural explica- tion, a one-sentence contextual definition that explains the term’s cultural or historical significance; and (3) a standardized English equivalent when applicable. 3.4 Dataset Statistics and Quality Control After annotation, we applied post hoc filtering to ensure consis- tency and interpretability of the benchmark. Annotators refined span boundaries, normalized terminology, and excluded instances that did not meet the dataset criteria. Specifically, instances were removed if they exhibited: (1) insufficient cultural salience, where the expression did not encode a clear culture-dependent concept; (2) weak contextual support, where the surrounding sentence was insufficient to disambiguate meaning; or (3) semantic misalignment, where the source and target sentences showed low correspondence due to overly literal translation or structural mismatch. English reference translations were drawn from the original bilin- gual sources, including subtitles, literary translations, and official publications. These translations were manually inspected to verify basic pragmatic adequacy. Instances were excluded if the English side consisted primarily of unglossed transliterations or if the trans- lation failed to reflect the intended meaning of the culture-loaded expression. In addition to sentence-level alignment, we annotated the corre- sponding spans of culture-loaded expressions on the English ref- erence side. Each instance therefore contains an explicit mapping between the source-language cultural span and its realization in the target language, enabling span-level inspection during evaluation. From a linguistic perspective, the dataset encompasses culture- loaded expressions realized as idioms, slang, and colloquialisms, alongside literary and poetic forms and entity-based culture-specific items. After rigorous filtering, the dataset was refined from an initial pool of approximately 12,000 candidates to (see Table 1 for detailed statistics). 4 Evaluation and Metric Analysis This section evaluates translation models on CulT-Eval and analyzes the adequacy of commonly used evaluation metrics for culture- loaded translation. We begin with standard sentence-level metrics, and then examine their behavior on culturally salient spans. 4.1 Sentence-Level Evaluation with Standard Metrics We evaluate a set of representative translation systems on CulT- Eval, including both multilingual NMT models and instruction- tuned large language models (LLMs) used in a zero-shot translation setting. The MT baselines include publicly released models such as NLLB-200-3.3B[26], hunyuan-7B[35] and MADLAD-400-10B[14]. From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine TranslationConference acronym ’X, June 03–05, 2018, Woodstock, NY Model BLEUCHRF++BERTScoreCOMETMetricX-QE 0-shot1-shot0-shot1-shot0-shot1-shot0-shot1-shot0-shot1-shot Machine Translation Models Hunyuan-MT-7B– Madlad400-10B-MT– NLLB200-3-3B – Open-Sourced and Proprietary LLMs Llama-3.1-8B-Instruct– DS-R1-D-Qwen-7B – Qwen3-8B-Instruct– Qwen3-32B-Instruct– DeepSeek-v3– GPT-5.1 – Table 3: Comprehensive evaluation results across five metrics. In addition, we evaluate several LLM-based systems, including GPT- 5.1 [25], Llama-3.1-Instruct [10], and Qwen3-Instruct series [31] and DeepSeek-V3 [7]. All systems are evaluated in a source-only setting, where models generate English translations directly from Chinese source sentences. For LLMs, we employ two prompting paradigms: (i) a vanilla zero-shot translation prompt, and (i) a one- shot prompt containing a single illustrative example. To ensure a rigorous evaluation, the one-shot example is held-out from the test set, and no supplementary cultural explications or reference trans- lations are provided at inference time. NMT systems are evaluated under their standard inference settings without prompt variations. 4.2 Sentence-level Metrics under Cultural Evaluation In Table 3,sentence-level metrics distinguish translation systems on CulT-Eval, with both NMT and LLM-based models achieving competitive scores and clear performance differences. However, these metrics do not explicitly evaluate whether culturally salient content is correctly preserved, which is central to the CulT-Eval task. To assess whether sentence-level metrics reflect cultural correct- ness, we analyze their alignment with human judgments, defined as whether the annotated culture-loaded span is correctly expressed in the translation. Table 2 reports segment-level correlations between automatic metrics and human judgments of cultural correctness. Across metrics, BLEU, ChrF, BLEURT, and COMET exhibit weak and unstable correlations, indicating that higher sentence-level scores do not reliably correspond to correct translation of culture-loaded spans. Representative examples in Table 4 further illustrate this mis- alignment. We observe translations that receive high sentence-level scores despite literalization, over-generalization, or omission of cultural meaning, as well as translations that accurately convey cultural meaning through paraphrasing or explicitation but receive low scores due to surface-level divergence from the reference. Type Content AnalysisScore Valid High Error SRC . . .强调自己吃软不吃硬，如果. . . REF ...amenable to coaxing but not coercion... SYS . . . better to persuasion than force. . . 32.1 0.83 ✗ Low Good SRC . . .典型的关系户问题，在一些. . . REF ...typical issue of nepotism, which... SYS . . . typical case of nepotism driven by. . . 20.3 0.64 ✓ * Score: BLEU (top) / COMET (bottom). Table 4: Performance Analysis. Together, these results demonstrate that sentence-level similarity is an unreliable proxy for cultural correctness. While standard met- rics capture overall translation quality, they fail to reflect whether culturally salient meaning is preserved, motivating a more fine- grained analysis of cultural translation errors. 4.3A Taxonomy of Culture-related Translation Errors Beyond the empirical observations in Section 4.2, which suggest a systematic misalignment between metrics and cultural correctness, we conduct a structured error analysis to characterize the distortion of cultural meaning. We introduce a taxonomy of culture-related translation errors to formalize these failure modes. By distinguish- ing specific types of cultural attrition, our taxonomy provides a rigorous framework for evaluating translation quality where con- ventional metrics fail. 4.3.1 Annotation Principle. A key challenge in analyzing cultural translation errors is that multiple error phenomena may co-occur within a single instance. To ensure consistency, we assign a primary error label according to a fixed priority order, ranging from omission to over-interpretation. This ordering reflects a progression from the complete absence of cultural realization to distorted or excessive realization, and ensures that each instance is associated with the Conference acronym ’X, June 03–05, 2018, Woodstock, NYTrovato et al. 0.00.20.40.60.81.0 Score DeepSeek-V3 GPT-5.1 Qwen3-7B Llama-3.1 MADLAD-400 NLLB-200 Cultural Correctness (Valid Rate) 0.00.10.20.30.40.50.60.7 Proportion Error Distribution Omission Literalization Sense Error Neutralization Mis-substitution Pragmatic Shift Over-interpretation Figure 3: Performance analysis of six selected models. The left chart displays the overall Cultural Correctness score. The right chart visualizes the distribution of seven specific error types within the incorrect samples. most fundamental source of cultural failure. Detailed definitions and examples are provided in the Appendix ??. 4.3.2 Error Categories. We identify seven recurrent error cate- gories that capture distinct ways in which cultural meaning may fail to be correctly realized in translation. All categories refer specif- ically to errors in the translation of culture-loaded spans, rather than to general translation errors.(1) Omission: the culture-loaded span is not realized in the translation, either through deletion or re- placement with an empty or vacuous expression. (2)Literalization: the translation preserves surface meaning through word-by-word rendering but fails to activate the conventional or idiomatic cultural sense. (3) Sense Error: an incorrect sense or referent is selected, resulting from misinterpretation rather than deliberate cultural substitution. (4) Neutralization: the functional meaning is broadly conveyed, but culturally specific features are flattened into generic expressions, weakening cultural salience. (5) Mis-substitution: a target-culture analogue is used as a replacement, but the analogy is misleading or non-equivalent. (6) Pragmatic Shift: social or interactional meaning is altered, including changes in politeness, honorifics, or perceived social relations. (7) Over-interpretation: additional cultural explanations, background information, or value judgments are introduced beyond what is explicit in the source. 4.3.3 Error Distributions across Systems. To illustrate how the pro- posed taxonomy manifests in practice, we examine the distribution of cultural correctness and error types across a set of representa- tive translation systems, all evaluated under the same source-only setting described in Section 4. Figure 3 presents the cultural correctness rate and the composi- tion of primary error categories for each system. The results show that cultural failures are systematic and vary in composition across systems, even when overall translation quality at the sentence level appears competitive. Importantly, these differences arise from how culturally loaded spans are realized, such as through omission, liter- alization, neutralization, pragmatic distortion, or overinterpretation, rather than from fluency or grammaticality on the surface. Error Type LexicalSemantic BLEU ChrF BERT COMET Omission✓ Literalization✗ Neutralization✗ △ △ Over-interpretation✗ Table 5: Metric sensitivity analysis. (✓: Sensitive;✗: Insensi- tive;△: Partial) 4.4 Structural Limitations of Sentence-level Evaluation for Cultural Correctness Taken together, the analyses reveal a structural mismatch between sentence-level evaluation and cultural correctness. Sentence-level metrics assume that overall similarity to a reference reflects mean- ing preservation, an assumption that breaks down when culturally salient meaning is realized through specific spans that may not substantially affect surface form. The error taxonomy makes this mismatch explicit by distin- guishing failure modes with different interactions with surface similarity. While omission errors remove content and are therefore often penalized, errors such as literalization, neutralization, and over-interpretation frequently preserve lexical or semantic over- lap while distorting culturally salient meaning, allowing affected translations to receive high sentence-level scores. Table 5 shows that this asymmetry is systematic. N-gram–based metrics consistently penalize omission but are largely insensitive to error types that preserve surface overlap, while embedding-based metrics exhibit only limited improvements and remain unreliable for detecting several prevalent cultural error categories. These findings indicate that the limitation of sentence-level met- rics stems from the evaluation objective itself rather than from metric design. By aggregating similarity at the sentence level, dif- ferent cultural failure mechanisms are conflated into a single score, motivating the need for an evaluation approach that explicitly tar- gets the realization of culture-loaded spans. In the next section, we introduce ACRE to operationalize this perspective. 5 ACRE We propose ACRE (Anchored Cultural Realization Evaluation), an automatic evaluation metric for assessing whether culturally salient meaning is correctly realized in translation. Unlike sentence-level similarity metrics that rely on surface overlap, ACRE is explicitly anchored by Cultural Explication annotations, which provide ground-truth definitions of culture-loaded spans and frame cultural evaluation as a structured verification problem rather than open- ended judgment. ACRE models cultural realization through two complementary components: Validity and Quality. Validity determines whether the intended cultural referent is correctly instantiated, while Quality assesses how appropriately that meaning is expressed. Formally, let 푆denote the source sentence,퐻the hypothesis translation,퐸the From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine TranslationConference acronym ’X, June 03–05, 2018, Woodstock, NY associated Cultural Explication, and퐶 the category of the culture- loaded span. ACRE is defined as: ACRE(푆,퐻,퐸,퐶)= I valid (퐻,퐸)·Φ quality (퐻,푆,퐶)(1) Here,I valid (퐻,퐸) ∈ 0,1is the Semantic Validity Indicator, computed by the Semantic Validator, which verifies whether the hypothesis instantiates the cultural referent defined in퐸. Transla- tions that fail this check are assigned a score of zero, preventing fluent but semantically incorrect realizations from being rewarded. The Quality componentΦ quality is computed only for valid in- stances: Φ quality (퐻,푆,퐶)= 훼 퐶 ·S fidelity (퐻,푆)+ 훽 퐶 ·S clarity (퐻)(2) whereS fidelity is the Fidelity Score, assessing the preservation of intended meaning and pragmatic force, andS clarity is the Clarity Score, assessing communicative intelligibility for target-language readers. 5.1 Category-conditioned Evaluation and Reference Implementation Based on the taxonomy introduced in Section 4, ACRE adopts category-conditioned protocols that determine how Quality is as- sessed, while leaving the metric definition unchanged. A Category Check first routes each instance to one of two evaluation protocols according to its category퐶. For Protocol A (Fact-centric), which applies to categories such as specific cultural concepts, material artifacts, and social institu- tions, evaluation emphasizes referential validity. Under this proto- col, strict alignment with the Cultural Explication is required, and deviations in referential identity directly invalidate the translation via the Semantic Validator. For Protocol B (Style-centric), which applies to idioms, literary ex- pressions, and slang, evaluation emphasizes functional equivalence. Under this protocol, paraphrasing or re-expression is permitted as long as pragmatic force and register are preserved. These category-conditioned protocols are instantiated in a multi- agent evaluation framework, termed CulT-Agent, as illustrated in Figure 4. The framework operationalizes the Validity and Quality components of ACRE through coordinated agents that correspond directly to the metric formulation. Within CulT-Agent, Validity is assessed by the Semantic Val- idator. For instances that pass Validity, Quality is assessed by two complementary agents: the Fidelity Critic, which computes the Fidelity Score, and the Nuance Critic, which computes the Clar- ity Score. These agents implement the Quality function defined in Eq. (2). Throughout evaluation, reference translations are used only as stylistic anchors rather than semantic ground truth, while Cultural Explications serve as the authoritative basis for determining cultural correctness. Detailed descriptions of individual agents and their coordination protocols are provided in Appendix ??. 5.2 Implementation Details In our experiments, ACRE is instantiated through CulT-Agent, which realizes the Semantic Validator, Fidelity Critic, and Nuance Critic using a large language model as a constrained judge. Each Source (S) Hypothesis(H) Explichesis(E) Category (C) Protocol A Input C Check Category C Protocol B Semantic Validator Input H+E Semance Critic Validator Nucance Critic Validator Output (0,1) Score Agent GPT-5.1_translation: 5 Qwen3-7B_translation: 3 LLama-3.1_translation: 4 Figure 4: Evaluation pipeline agent executes a specific functional role defined by the category- conditioned protocols. All evaluations are implemented with grok-4.1-fast, using de- terministic decoding and fixed prompts. The prompts instantiate the roles of individual agents and their coordination protocol, and are held constant across all experiments to ensure reproducibility. For transparency, all prompt templates and agent specifications are provided in Appendix A 6 Experiments and Analyses 6.1 Cultural Correctness under ACRE We evaluate ACRE using the same set of 4.1 translation systems and experimental configuration described in Section. Table 6 reports ACRE scores and their components for representative machine translation systems and large language models on CulT-Eval. The results indicate that the realization of culture-loaded ex- pressions remains a challenging and unreliable aspect of current translation systems. Across models, validity aspect of current trans- lation systems. Across models, validity failures are frequent, sug- gesting that culture-loaded expressions are often misinterpreted, generalized, or omitted, even when translation appear fluent at the sentence level. Such failures directly limit final ACRE scores and reveal deficiencies that are obscured by sentence-level similarity metrics. While stronger language models achieve higher validity rates than machine translation systems, these improvements are still in- cremental rather than defintive. Correct handling of culture-loaded expressions is not consistently guaranteed, and substantial varia- tion persists in how cultural meaning is preserved and expressed once referential correctness is satisfied. Importantly, these differ- ences are not driven by surface-level similarity or grammaticality, but by how systems handle culture-loaded expressions at both the referential and pragmatic levels. By decomposing cultural correct- ness into semantic validity and realization quality, ACRE provides a more informative view of translation behavior than aggregate Conference acronym ’X, June 03–05, 2018, Woodstock, NYTrovato et al. Model Stage I: ValidityStage I: QualityFinal ACRE 0-shot1-shotFidelityClarity0-shot1-shot Machine Translation Models Hunyuan-MT-7B– Madlad400-10B-MT– NLLB200-3-3B – Open-Sourced and Proprietary LLMs Llama-3.1-8B-Instruct– DS-R1-D-Qwen-7B – Qwen3-8B-Instruct– Qwen3-32B-Instruct– DeepSeek-v3– GPT-5.1 – Table 6: Comprehensive ACRE evaluation results across proprietary and open-source models. ConfigurationCorrelation (푟)Δ푟 Full ACRE0.88– w/o Validity Gate (I valid )0.76-0.12 w/o Adaptive Routing (Protocols)0.81-0.07 w/o Explication Anchor (퐸)0.62-0.26 w/o Reference Anchor (푅)0.85-0.03 Table 7: Ablation study of ACRE. We compare the semantic validity (Stage I) and quality profiling (Stage I) across zero- shot and one-shot settings. sentence-level metrics, which conflate distinct sources of cultural failure. 6.2 Ablation Study To verify the contribution of each component in ACRE, we con- duct an ablation study on the translations generated by Qwen3-8B. Table 7 reports the correlation results. Removing the semantic va- lidity gate leads to a drop in human alignment, confirming its role in filtering hallucinated but fluent translations. The most signif- icant degradation occurs when excluding Cultural Explications, demonstrating that external semantic grounding is more critical for cultural evaluation than internal parametric knowledge alone. In contrast, the absence of reference translations causes only a marginal performance decline, indicating that ACRE relies more on semantic definitions than on surface-form similarity. These re- sults underscore that ACRE’s components provide complementary diagnostic power, with explications serving as the primary anchor for cultural correctness. 6.3 Diagnostic Sensitivity to Cultural Error Types Figure 5 compares the sensitivity of ACRE and sentence-level met- rics to representative cultural error types in the translation of culture-loaded expressions. ACRE exhibits substantial score drops Literalization Sense Error Neutralization 0 20 40 60 80 100 Metric Sensitivity (Score Drop % on Error) -62% -82% -48% Metric Sensitivity to Cultural Errors ACRE (Ours)COMETBLEU Figure 5: Sensitivity of evaluation metrics to cultural trans- lation errors. for sense errors, literalization, and neutralization, indicating strong responsiveness to failures that distort cultural meaning while pre- serving surface overlap. In contrast, COMET shows only moderate sensitivity, and BLEU remains largely insensitive across all error types, with minimal score variation even when cultural meaning is severely distorted. These results demonstrate that ACRE captures diagnostic signals that are systematically missed by sentence-level similarity metrics, directly explaining the discrepancies observed in Section 6.1. 7 Conclusion In this paper, we introduced CulT-Eval, a large-scale benchmark for evaluating machine translation of culture-loaded expressions, together with a unified cultural taxonomy and fine-grained error annotations. Through extensive evaluation, we showed that widely used sentence-level metrics fail to reliably reflect cultural correct- ness, often overlooking systematic meaning distortions. To address this gap, we proposed ACRE, a taxonomy-aware evaluation metric anchored in cultural explications, which demonstrates substantially From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine TranslationConference acronym ’X, June 03–05, 2018, Woodstock, NY stronger alignment with human judgments and higher diagnos- tic sensitivity to culture-related errors. Our findings highlight the limitations of surface-level evaluation and underscore the need for culturally grounded assessment frameworks. We hope CulT-Eval and ACRE will facilitate more reliable evaluation and foster future research on culturally aware machine translation. 8 Acknowledgments Identification of funding sources and other support, and thanks to individuals and groups that assisted in the research and the preparation of the work should be included in an acknowledgment section, which is placed just before the reference section in your document. This section has a special environment: acks ... acks so that the information contained therein can be more easily col- lected during the article metadata extraction phase, and to ensure consistency in the spelling of the section heading. Authors should not prepare this section as a numbered or un- numbered ; please use the “acks” environment. References [1] Javier Franco Aixela. 1999. 4 Culture-specific Items in Translation. Multilingual Matters, Bristol, Blue Ridge Summit, 52–78. doi:doi:10.21832/9781800417915-005 [2]Christos Baziotis, Prashant Mathur, and Eva Hasler. 2023. Automatic Evaluation and Analysis of Idioms in Neural Machine Translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, 3682– 3700. doi:10.18653/v1/2023.eacl-main.267 [3]Yong Cao, Yova Kementchedjhieva, Ruixiang Cui, Antonia Karamolegkou, Li Zhou, Megan Dare, Lucia Donatelli, and Daniel Hershcovich. 2024. Cultural Adap- tation of Recipes. Transactions of the Association for Computational Linguistics 12 (2024), 80–99. doi:10.1162/tacl_a_00634 [4]Andong Chen, Lianzhang Lou, Kehai Chen, Xuefeng Bai, Yang Xiang, Muyun Yang, Tiejun Zhao, and Min Zhang. 2025. Benchmarking LLMs for Translat- ing Classical Chinese Poetry: Evaluating Adequacy, Fluency, and Elegance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing. Association for Computational Linguistics, Suzhou, China, 33019–33036. doi:10.18653/v1/2025.emnlp-main.1678 [5]Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Jingwen Chen, Zhichao Huang, Tao Li, Yifu Li, Huiying Lin, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, and Yonghui Wu. 2025. Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters. arXiv:2507.13618 [cs.CL] https://arxiv.org/abs/2507.13618 [6]Francesca De Luca Fornaciari, Begoña Altuna, Itziar Gonzalez-Dios, and Maite Melero. 2024. A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models. In Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024). Association for Computational Linguistics, Mexico City, Mexico (Hybrid), 35–44. doi:10.18653/v1/2024.figlang-1.5 [7]DeepSeek-AI et al.2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437 [8]Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis, Tom Kocmi, George Foster, Alon Lavie, and André F. T. Martins. 2022. Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust. In Proceedings of the Seventh Conference on Machine Translation (WMT). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 46–68. https://aclanthology.org/2022.wmt-1.2/ [9]Yicheng Fu, Zhemin Huang, Liuxin Yang, Yumeng Lu, and Zhongdongming Dai. 2025. CHENGYU-BENCH: Benchmarking Large Language Models for Chinese Idiom Understanding and Use. In Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 2355–2366. doi:10.18653/v1/2025.emnlp-main.119 [10]Aaron Grattafiori et al.2024.The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783 [11]Nuno M. Guerreiro, Ricardo Rei, Daan van Stigt, Luísa Coheur, Pierre Colombo, and André Martins. 2023. xcomet: Transparent Machine Translation Evaluation through Fine-grained Error Detection. Transactions of the Association for Compu- tational Linguistics 12 (2023), 979–995. https://api.semanticscholar.org/CorpusID: 264146484 [12]Hongcheng Guo, Fei Zhao, Shaosheng Cao, Xinze Lyu, Ziyan Liu, Yue Wang, Boyang Wang, Zhoujun Li, Chonggang Lu, Zhe Xu, and Yao Hu. 2025. Redefining Machine Translation on Social Network Services with Large Language Models. arXiv:2504.07901 [cs.CL] https://arxiv.org/abs/2504.07901 [13]Juraj Juraska, Mara Finkelstein, Daniel Deutsch, Aditya Siddhant, Mehdi Mirza- zadeh, and Markus Freitag. 2023.MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task. In Conference on Machine Translation. https://api.semanticscholar.org/CorpusID:265608038 [14]Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2023. MADLAD-400: A Multilingual And Document- Level Large Audited Dataset. arXiv:2309.04662 [cs.CL] https://arxiv.org/abs/ 2309.04662 [15]Shuang Li, Jiangjie Chen, Siyu Yuan, Xinyi Wu, Hao Yang, Shimin Tao, and Yanghua Xiao. 2023. Translate Meanings, Not Just Words: IdiomKB’s Role in Op- timizing Idiomatic Translation with Language Models. arXiv:2308.13961 [cs.CL] https://arxiv.org/abs/2308.13961 [16]Yunlong Liang, Fandong Meng, Jiaan Wang, and Jie Zhou. 2025. SlangDIT: Benchmarking LLMs in Interpretative Slang Translation. arXiv:2505.14181 [cs.CL] https://arxiv.org/abs/2505.14181 [17] Ananya Mukherjee, Saumitra Yadav, and Manish Shrivastava. 2025. Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting. In Proceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics, Abu Dhabi, UAE, 4779–4788. https://aclanthology.org/2025.coling-main.320/ [18]P. Newmark. 1988. Approaches to Translation. Prentice Hall. https://books. google.com/books?id=-lchAQAAMAAJ [19]Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/ CorpusID:11080756 [20] Maja Popović. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation. Association for Computational Linguistics, Copenhagen, Denmark, 612–618. doi:10.18653/v1/W17-4770 [21]Lorenzo Proietti, Stefano Perrella, and Roberto Navigli. 2025. Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress. arXiv:2506.19571 [cs.CL] https://arxiv.org/abs/2506.19571 [22]Ricardo Rei, José G. C. de Souza, Duarte M. Alves, Chrysoula Zerva, Ana C. Farinha, T. Glushkova, Alon Lavie, Luísa Coheur, and André F. T. Martins. 2022. COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. In Conference on Machine Translation. https://api.semanticscholar.org/CorpusID: 256461051 [23] Ricardo Rei, Craig Alan Stewart, Catarina Farinha, and Alon Lavie. 2020. Unba- bel’s Participation in the WMT20 Metrics Shared Task. In Conference on Machine Translation. https://api.semanticscholar.org/CorpusID:225103036 [24] Sheikh Shafayat, Dongkeun Yoon, Woori Jang, Jiwoo Choi, Alice Oh, and Seohyon Jung. 2025. A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls. arXiv:2412.01340 [cs.CL] https://arxiv.org/abs/2412. 01340 [25]Aaditya Singh et al.2025. OpenAI GPT-5 System Card. arXiv:2601.03267 [cs.CL] https://arxiv.org/abs/2601.03267 [26]NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoff- man, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv:2207.04672 [cs.CL] https://arxiv.org/abs/2207.04672 [27]Yanzhi Tian, Cunxiang Wang, Zeming Liu, Heyan Huang, Wenbo Yu, Dawei Song, Jie Tang, and Yuhang Guo. 2026. Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation. https://api.semanticscholar. org/CorpusID:284648564 [28]Bin Wang, Zhengyuan Liu, Xin Huang, Fangkai Jiao, Yang Ding, Ai Ti Aw, and Nancy F Chen. 2024. SeaEval for Multilingual Foundation Models: From Cross- Lingual Alignment to Cultural Reasoning. NAACL (2024). [29]Minghan Wang, Viet-Thanh Pham, Farhad Moghimifar, and Thuy-Trang Vu. 2025. Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model. arXiv:2501.11953 [cs.CL] https://arxiv.org/abs/2501.11953 [30]Rachel Wicks and Matt Post. 2023. Identifying Context-Dependent Translations for Evaluation Set Production. arXiv:2311.02321 [cs.CL] https://arxiv.org/abs/ Conference acronym ’X, June 03–05, 2018, Woodstock, NYTrovato et al. 2311.02321 [31]An Yang et al.2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388 [32] Cai Yang, Yao Dou, David Heineman, Xiaofeng Wu, and Wei Xu. 2025. Evaluating LLMs on Chinese Idiom Translation. arXiv:2508.10421 [cs.CL] https://arxiv.org/ abs/2508.10421 [33] Binwei Yao, Ming Jiang, Tara Bobinac, Diyi Yang, and Junjie Hu. 2024. Bench- marking Machine Translation with Cultural Awareness. In Findings of the Associ- ation for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, Miami, Florida, USA, 13078–13096. doi:10.18653/v1/2024.findings- emnlp.765 [34]Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating Text Generation with BERT. ArXiv abs/1904.09675 (2019). https://api.semanticscholar.org/CorpusID:127986044 [35]Mao Zheng, Zheng Li, Bingxin Qu, Mingyang Song, Yang Du, Mingrui Sun, and Di Wang. 2025. Hunyuan-MT Technical Report. arXiv:2509.05209 [cs.CL] https://arxiv.org/abs/2509.05209 A Prompt A.1 Part Two Etiam commodo feugiat nisl pulvinar pellentesque. Etiam auctor sodales ligula, non varius nibh pulvinar semper. Suspendisse nec lectus non ipsum convallis congue hendrerit vitae sapien. Donec at laoreet eros. Vivamus non purus placerat, scelerisque diam eu, cursus ante. Etiam aliquam tortor auctor efficitur mattis. B Online Resources Nam id fermentum dui. Suspendisse sagittis tortor a nulla mollis, in pulvinar ex pretium. Sed interdum orci quis metus euismod, et sagittis enim maximus. Vestibulum gravida massa ut felis suscipit congue. Quisque mattis elit a risus ultrices commodo venenatis eget dui. Etiam sagittis eleifend elementum. Nam interdum magna at lectus dignissim, ac dignissim lorem rhoncus. Maecenas eu arcu ac neque placerat aliquam. Nunc pulv- inar massa et mattis lacinia. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine TranslationConference acronym ’X, June 03–05, 2018, Woodstock, NY Prompt Template: Culture-Mining Agent (Data Extraction) System Instruction: You are a bilingual data mining expert. Your task is to extract "Culture-Loaded Sentence Pairs" from the provided raw bilingual text. IMPORTANT CONSTRAINTS: • You must identify sentences containing Idioms, Slang, or Culture-Specific Items (e.g., history, food, traditions). • You must extract both the Chinese source and the English translation. • Ignore common sentences that lack specific cultural depth. • Output must be a strictly valid JSON list. User Input: [User] Extract cultural pairs from the following text chunk: Raw Text: RAW_TEXT_CHUNK Return format: [ "src": "Chinese sentence...", "tgt": "English sentence...", "focus_term": "The specific cultural word" ] Figure 6: The prompt template for the Culture-Mining. Prompt Template: Fine-grained Taxonomy Classifier System Instruction: You are a Cultural Linguist. Your task is to classify a specific Chinese term into one of the Five Cultural Categories. Taxonomy Definitions: (1) Ecological Culture: Terms related to specific animals, plants, geography, climate, or natural phenomena unique to the region (e.g.,梅雨,熊猫,黄河). (2) Material Culture: Terms related to food, clothing, architecture, artifacts, or daily necessities (e.g.,旗袍,饺子,炕,四合院). (3) Social Culture: Terms related to institutions, titles, festivals, customs, history, or social hierarchy (e.g.,高考,春节,尚书,关系户). (4) Religious Culture: Terms related to beliefs, mythology, philosophy (Confucianism, Taoism, Buddhism), or taboos (e.g.,阴阳,菩萨,玉帝). (5) Linguistic Culture: Terms involving idioms (Chengyu), metaphors, slang, proverbs, or witticisms (e.g.,吃软不吃硬,摸鱼,破防). User Input: [User] Context Sentence: SOURCE_SENTENCE Target Term: FOCUS_TERM Task: 1. Analyze the meaning of the term in context. 2. Assign strictly ONE category from the list above. 3. Provide a short explanation. Return Format JSON: "term": "FOCUS_TERM", "category": "Social Culture", "reason": "It refers to a specific historical government position." Figure 7: Prompt template for fine-grained cultural classification. Benchmark MultipleCulture-SpecificFigurative orContext-Dependent Explicit Error Error-Level Evaluation Beyond Expression TypesMeaningNon-Literal MeaningInterpretationTaxonomyAnalysisSurface Metrics CHENGYU-BENCH✗✓✗ IdioTS✗✓✗ SlangDIT✗✓✗✓ CAMT✓✗✓✗✓ IdiomEval✗✓△✓ CulT-Eval✓ Table 8: Comparison of CulT-Eval with existing culture-related translation benchmarks.✓: supported,✗: not supported,△: partially supported. Conference acronym ’X, June 03–05, 2018, Woodstock, NYTrovato et al. Prompt Template: Core Dispatcher (Protocol Routing) System Instruction: You are a taxonomy expert in cultural linguistics. Your task is to classify the "Cultural Category" of a specific term within a source sentence. IMPORTANT DEFINITIONS: • Protocol A (Fact-Centric): – Specific Concepts: Concrete institutions, historical artifacts, unique objects, technical items. – Proper Nouns: Names of unique places, people, organizations, festivals. – Core Logic: Referential Precision. The translation must point to the EXACT same entity. – Tie-Breaker: If a term has a specific history/material existence, choose A (e.g., "Forbidden City"). • Protocol B (Style-Centric): – Figurative Language: Idioms (Chengyu), metaphors, allegories. – Slang/Pop Culture: Buzzwords, internet memes, dialect words used for effect. – Core Logic: Pragmatic Equivalence. Imagery and tone are more important than literal words. – Tie-Breaker: If the term describes a situation/feeling rather than an object, choose B (e.g., "eating vinegar"). User Input: [User] Analyze the cultural term inside the brackets "[]" in the source sentence. [Input] Source: SOURCE_WITH_BRACKETS [Task] Classify the term based on its primary function in this specific context. Return ONLY one label: "Protocol A" or "Protocol B". Figure 8: The Core Dispatcher prompt. It routes instances to the correct evaluation protocol (Fact-Centric or Style-Centric) based on linguistic features and functional context. Prompt Template: Stage I - Semantic Validator (The Gate) System Instruction: You are a strict Semantic Validator. Your ONLY goal is to detect "Hallucinations" (Error Type B3) or "Severe Mis-substitutions" (Error Type B5). CRITICAL GROUND TRUTH RULE: • You must rely EXCLUSIVELY on the provided [Cultural Explication]. • If the Explication says X, and the Model translates it as Y (where Y!=X), it is INVALID. • Ignore your own internal knowledge if it conflicts with the Explication. NEGATIVE CONSTRAINTS (What NOT to check): • DO NOT check for fluency, grammar, or style. • DO NOT check for "neutralization" (loss of flavor). A boring but factually correct translation is VALID. • DO NOT penalize literal translations here, provided they refer to the correct concepts. User Input: [User] [Data] Protocol: PROTOCOL_LABEL Source: SOURCE Cultural Term: TERM Cultural Explication (Ground Truth): EXPLICATION Model Hypothesis: HYPOTHESIS [Task] 1. Ignore whether the translation is elegant. 2. Check ONLY if the semantic meaning matches the Explication. 3. If the hypothesis invents a new entity (Hallucination) or refers to a completely wrong concept, mark INVALID. [Output format] Reasoning: [Brief analysis] Decision: VALID or INVALID Figure 9: The Semantic Validator prompt. It acts as a strict "Validity Gate" (I 푣푎푙푖푑 ), filtering out hallucinations by anchoring evaluation to the Cultural Explication. From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine TranslationConference acronym ’X, June 03–05, 2018, Woodstock, NY Prompt Template: Stage I-A - Fidelity Critic (Nuance Analysis) System Instruction: You are a Translation Critic focusing on "Fidelity" and "Pragmatic Equivalence". Your job is to score how well the Model Hypothesis captures the intended nuance and force, strictly according to the active Protocol. REFERENCE USAGE RULE: • The [Reference Anchor] is just ONE possible translation. • DO NOT penalize the hypothesis for using different words than the reference. • DO judge based on whether the *meaning* and *effect* are equivalent. DYNAMIC GUIDELINE (Protocol-Dependent): DYNAMIC_INSTRUCTION SCORING SCALE: • 5 (Perfect): Captures full nuance, tone, and imagery. • 3 (Acceptable): Core meaning present, but significant nuance lost (e.g., too generic). • 1 (Failure): Severe mistranslation or complete loss of meaning. User Input: [User] [Data] Protocol: PROTOCOL_LABEL Source: SOURCE Reference Anchor: REFERENCE Model Hypothesis: HYPOTHESIS [Task] Evaluate Fidelity (1-5). Focus: Does the hypothesis capture the PROTOCOL_LABEL constraints? [Output format] Reasoning: [Critique] Score: [1-5] Figure 10: The Fidelity Critic prompt. It dynamically adjusts criteria (e.g., penalizing literalism in Protocol B vs. generalization in Protocol A) to assess translation nuance. Prompt Template: Stage I-B - Clarity Critic (Communicative Intelligibility) System Instruction: You are a Target Audience Evaluator simulating an English native reader who has NO prior knowledge of Chinese culture. Your goal is to assess "Communicative Intelligibility". SCORING PHILOSOPHY (Thick Translation): •Bonus (+): Reward **Explicitation** (e.g., in-text glosses, brief explanations, appositives like "Kang, a heated brick bed"). Reward **Transparency** (rephrasing for clarity). • Penalty (-): Penalize **Opaque Terms** (Pinyin without context). Penalize **Confusion** (if the reader would ask "What does that mean?"). SCORING SCALE: • 5 (Crystal Clear): Seamlessly bridged. The naive reader fully understands. • 3 (Gist Only): Reader gets the general idea but misses the cultural specific. • 1 (Incomprehensible): Complete communication breakdown. User Input: [User] [Data] Source: SOURCE Model Hypothesis: HYPOTHESIS [Task] Evaluate Clarity (1-5) for a non-Chinese reader. Did the translator build a bridge for the reader, or leave them confused? [Output format] Reasoning: [Analysis] Score: [1-5] Figure 11: The Communicator prompt. It rewards explicitation strategies consistent with Appiah’s "Thick Translation" theory.