Paper deep dive
What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?
Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/22/2026, 6:11:54 AM
Summary
The paper introduces MultiTempBench, a multilingual temporal reasoning benchmark covering 15,000 examples across five languages and three calendar systems. It identifies a two-stage bottleneck in LLM temporal reasoning: date fragmentation (measured by the new mDFR metric) in low-resource languages, and internal temporal linearity in high-resource languages.
Entities (4)
Relation Signals (2)
MultiTempBench → evaluates → LLM
confidence 95% · We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio
mDFR → measures → Tokenisation
confidence 90% · we propose mDFR, a multilingual extension of Date Fragmentation Ratio... to measure the tokenisation quality of dates
Cypher Suggestions (2)
Find all benchmarks and the models evaluated on them. · confidence 90% · unvalidated
MATCH (b:Benchmark)-[:EVALUATES]->(m:Model) RETURN b.name, m.name
Identify metrics used to evaluate model performance. · confidence 85% · unvalidated
MATCH (m:Metric)-[:MEASURES]->(p:Process) RETURN m.name, p.name
Abstract
Abstract:We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: this https URL
Tags
Links
- Source: https://arxiv.org/abs/2603.19017v1
- Canonical: https://arxiv.org/abs/2603.19017v1
Full Text
67,594 characters extracted from source content.
Expand or collapse full text
What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time? Gagan Bhatia 1 Ahmad Muhammad Isa 1 Maxime Peyrard 2 Wei Zhao 1 1 University of Aberdeen 2 Université Grenoble Alpes & CNRS wei.zhao@abdn.ac.uk Abstract We presentMULTITEMPBENCH, a multilin- gual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conver- sion, and temporal relation extraction across five languages (English, German, Chinese, Ara- bic, and Hausa) and multiple calendar conven- tions (Gregorian, Hijri, and Chinese Lunar). MULTITEMPBENCHcontains15,000exam- ples built by translating750curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce themultilingual Date Fragmentation Ratio(mDFR), calibrated with human sever- ity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal arte- facts is a resource-dependent bottleneck: in low-resource languages and rarer calendar for- mats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high- resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed- effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmen- tation is the stronger predictor in low-resource languages. 1 1 Introduction Time is a universal substrate for human reasoning, but temporal expressions are deeplylanguage-and culture-specific. Real-world systems—calendar as- sistants, travel planners, clinical and legal timeline reconstruction, historical question answering, and forecasting—must interpret and manipulate dates, times, and temporal relations expressed in hetero- geneous surface forms (e.g.,2024-05-01vs. “1 May 2024”) and under distinct calendrical conven- tions (e.g., Gregorian vs. Hijri vs. Lunar). These requirements are inherently multilingual: users rou- tinely mix scripts, localized month lexemes, and 1 https://github.com/gagan3012/mtb Figure 1:Mechanistic understanding of multilingual temporal reasoning in MULTITEMPBENCH. calendar markers, and many high-stakes workflows depend on correct temporal normalisation and arith- metic across languages and regions. Recent tem- poral benchmarks have advanced our understand- ing of LLMs’ abilities in date arithmetic, tempo- ral ordering, and time-sensitive QA, but they over- whelmingly focus on English and Gregorian rep- resentations (Wang and Zhao, 2024; Fatemi et al., 2024; Zhu et al., 2024; Chu et al., 2023; Islakoglu and Kalo, 2025; Wei et al., 2025; Liu et al., 2025; Sasaki et al., 2025; P ̨ezik et al., 2025). In paral- lel, emerging work oncross-calendarreasoning highlights that LLMs remain inadequate for inter- calendar conversion and that non-Gregorian tempo- ral structure is still underexplored despite its global relevance (Han et al., 2025; Saxena et al., 2025; Miao et al., 2025a; Wang and Dong, 2026; Holter- mann et al., 2025). An orthogonal challenge concernshow dates are presented to the model. Temporal strings are structured symbolic objects, yet subword tokenis- ers (e.g., BPE and byte-level tokenisation) can frag- ment them into opaque substrings, potentially eras- ing semantic boundaries such as year/month/day separators and calendar markers (Spathis and Kawsar, 2023; Bhatia et al., 2025a). More broadly, multilingual tokenisation is known to induce sys- tematic inequities: low-resource languages often incur heavier fragmentation (a “token tax”), which 1 arXiv:2603.19017v1 [cs.CL] 19 Mar 2026 increases effective sequence length and can de- grade downstream performance (Ahia et al., 2023; Petrov et al., 2023; Lundin et al., 2025a; Kanji- rangat et al., 2025). For numeracy specifically, the choice between digit-level tokenisation and larger numeric chunks yields distinct arithmetic failure modes: inconsistent segmentation forces the model to infer place value and grouping from unstable boundaries, complicating the learning of multi-digit operations and carry-like mechanisms (Singh and Strouse, 2024; Kreitner et al., 2025). These observations motivate a natural hypothesis: multilingual temporal failures may be primarily a tokenisation problem. However, two gaps prevent a mechanistic account. First, existing tokenisa- tion analyses typically target general text or broad downstream tasks, rather thancontrolled temporal expressionsthat combine digits, delimiters, month lexemes, and calendar markers across scripts and calendar systems. Even temporally focused stud- ies largely remain monolingual or calendar-specific (Bhatia et al., 2025a; Han et al., 2025; Miao et al., 2025a). Second, even when behavioural gaps are documented (e.g., accuracy differences and system- atic error patterns across languages and formats), we lack clarity onwherethe failure arises in the processing pipeline: does temporal information de- grade at the input layer, fail to form an abstract representation suitable for computation, or break during reasoning and decoding? Prior mechanis- tic work suggests that LLMs can encode ordered scalar attributes in ways that are approximately lin- early decodable from hidden states, and can exhibit stable latent directions corresponding to monotonic temporal progression (Gurnee and Tegmark, 2024; El-Shangiti et al., 2025), but these findings have not been connected to multilingual, multi-calendar tem- poral reasoning in a controlled setting. As a result, it remains unclear whether multilingual temporal competence requires (i) better surface segmenta- tion, (i) a shared internal “calendar geometry” that supports computation, or (i) both. Our contributions are summarized as follows: (i)A controlled multilingual, multi-calendar benchmark.We releaseMULTITEMP- BENCH, comprising 15,000 examples across 5 languages, 3 temporal tasks, and multi- ple date-format complexity levels, including Gregorian, Hijri, and Lunar calendar sys- tems. We evaluate a broad set of LLMs (open-weight and proprietary) in a zero-shot setting, quantifying how language resource level, format complexity, and calendar sys- tem affect temporal reasoning. (i)A multilingual fragmentation metric.We proposemDFR, a multilingual extension of Date Fragmentation Ratio (Bhatia et al., 2025a) that penalises semantically destruc- tive segmentations (e.g., digit splitting and boundary loss) with weights learned from human severity ratings. (i)A mechanistic account of multilingual tem- poral performance gaps.We show that bot- tlenecks shift by language:date fragmen- tationis most predictive of failure in low- resource regimes (where it disrupts access to compositional components), whereastempo- ral linearity(probeR 2 ) is the strongest pre- dictor of temporal task performance in high- resource languages once calendar compo- nents (e.g., Year/Month/Day) are frequently accessible in training data. This supports a two-stage view in which tokenisation con- trols an LLM’s surface-level access to calen- dar components while temporal linearity con- trols internal temporal representations. To test this directly, we complement descriptive analyses with a crossed mixed-effects regres- sion over all model predictions, allowing us to compare the contribution of date fragmen- tation and temporal linearity across resource levels. 2 Related Works Tokenisation bias in multilingual models.To- kenisation remains a critical source of disparity in multilingual LLMs. Recent studies confirm that low-resource languages and dialects suffer a token tax, inflated sequence lengths that degrade perfor- mance and increase compute costs (Lundin et al., 2025a; Kanjirangat et al., 2025). This disparity is particularly acute for Indian languages, where stan- dard vocabularies often fragment morphological units (Karthika et al., 2025). Such fragmentation may critically impact numeric and temporal reason- ing. Bhatia et al. (2025a) identify date fragmenta- tion as a hidden bottleneck, showing that Byte-Pair Encoding (BPE) often splits dates into opaque sub- strings that hinder temporal arithmetic. Similarly, Singh and Strouse (2024) demonstrate that standard tokenisation degrades arithmetic performance com- pared to single-token number embeddings. Our 2 work extends this line of inquiry by introducing MULTITEMPBENCHto systematically isolate how tokenisation quality of dates, measured via our pro- posed mDFR metric, affect reasoning capabilities in a multilingual setting. Mechanisms of time: memorisation vs. rea- soning.Evaluating temporal understanding re- quires distinguishing between pattern matching and robust reasoning. Indeed, recent studies re- veal that while LLMs maintain stable performance on memorisation-based temporal tasks, their accu- racy sharply declines on reasoning-intensive tasks, especially when navigating temporal shifts or inte- grating new knowledge (Mazzia et al., 2026; Li and Goyal, 2025; Li et al., 2025). Benchmarks such as ChronoSense (Islakoglu and Kalo, 2025), SPAN (Miao et al., 2025b), DateLogicQA (Bhatia et al., 2025b) and TimeBench (Chu et al., 2023) reveal that LLMs struggle with symbolic constraints and temporal commonsense. Mechanistically, Gurnee and Tegmark (2024) find that models possess linear subspaces representing space and time, suggesting relevant information is encoded but not always ex- ploited. However, Mamidanna et al. (2025) observe that computation is often aggregated only at the fi- nal token, creating a fragile information bottleneck. We situateMULTITEMPBENCHat the intersection of these fields. Unlike broad benchmarks, we use controlled date expressions to disentangle tokenisa- tion from reasoning. 3 Our MULTITEMPBENCH Dataset construction.We introduceMULTI- TEMPBENCH, a multilingual temporal reasoning benchmark derived from three existing datasets: TRAM (Wang and Zhao, 2024), ToT (Fatemi et al., 2024), and FreshBench (Zhu et al., 2024). TRAM contains 526,668 multiple-choice questions across 10 temporal reasoning tasks covering the period from 1000 to 2024. ToT consists of 46,480 ques- tions focusing on temporal semantics and arith- metic from 52 AD to 2087. FreshBench provides 4,643 forecasting questions from 1900 to 2025. To construct the English foundation ofMULTITEMP- BENCH, we curated a balanced subset of 750 ques- tions: 250 from TRAM, 250 from ToT, and 250 from FreshBench, covering three temporal reason- ing tasks: (i)Date Arithmetic, which evaluates the ability of LLMs to perform addition and subtrac- tion on dates; (i)Time Zone Conversion, which tests the understanding of LLMs to calculate time Lang. (Size) TypePatternExample English (300GB) ISOYYYY-M-D2023-07-03 Numeric D/M/Y03/07/2023 TextualDD Month Y03 July 2023 PhrasalDay of Month Y3rd of July 2023 German (66GB) ISOYYYY-M-D2023-07-03 Numeric D.M.Y03.07.2023 TextualDD. Month Y03. Juli 2023 PhrasalDD. Mon... Y03. Juli des Jahres 2023 Chinese (47GB) ISOYYYY-M-D2023-07-03 Numeric D/M/Y03/07/2023 TextualY年M月D日2023年07月03日 LunarTraditional二零二三年六月初九 Arabic (28GB) ISOYYYY-M-D2023-07-03 Numeric D/M/Y03/07/2023 TextualDD Month Y3 2 0 2 ÒJ ÀÒK 3 HijriHijri D Mon YÎ5 0 4 » B @ ©J K . P 3 Hausa (0.3GB) ISOYYYY-M-D2023-07-03 Numeric D/M/Y03/07/2023 TextualDD ga Month Y03 ga Yuli 2023 HijriDD Mon Y AH03 Ramadan 1445 AH Table 1:Date formats and calendar systems inMUL- TITEMPBENCH.The “Type” column indicates the format category or specific calendar system (e.g., Lunar, Hijri). All others use the Gregorian calendar. differences between regions and (i)Temporal Re- lation, which infers the relationship (e.g., before, after, simultaneous) between a specific event and a reference date. Data samples in these tasks are pro- vided in Table 5 (appendix). We selected questions where date components (year, month, and day) are fully specified, then we preprocessed them to re- move synthetic entities (e.g., “E15”) and internal prompting instructions, ensuring all questions are grammatically correct and natural. Multilingual extension.We extended these 750 English questions into four additional languages: German, Chinese, Hausa, and Arabic, using the Google Translate (Comanici et al., 2025). We manually verified the machine-generated transla- tions, for each target language, two native speakers were involved to validate the translations and edited them (when necessary) to ensure that both the lin- guistic content and date formats were error-free. The set of languages we selected is based on our linguistic expertise, as well as diversified data avail- ability on the CommonCrawl-100 corpus (Suárez et al., 2019; Penedo et al., 2025), ranging from high-resource languages like English (300 GB) to low-resource ones like Hausa (0.3 GB). As detailed in Table 1, these languages also cover three calen- dar systems: Gregorian, Hijri, and Chinese Lunar for temporal reasoning. Data format extension.To assess robustness of temporal reasoning across date formats, we uti- 3 lized a template-based approach to expand each question into four variants per language with in- creasing levels of complexity. As shown in Table 1, these formats range fromstandard ISO Numeric (e.g., Y-M-D) toLocalised Numericfor- mats using local separators, and finally toCalendar- specificphrases (e.g., “03 Ramadan 1445 AH” for Hausa). For calendar-specific variants (e.g., Hi- jri and Chinese Lunar), we converted to target- language calendars by using existing calender con- version tools (Alshehri, 2024); the results were verified by native speakers. This expansion results in 3,000 questions per language, totalling 15,000 questions. For a detailed description of the con- version tools, library specifications, and language- specific formatting rules, we refer to Appendix A.1. 4 Our Approach Our aim is to identify the underlying factors that control temporal reasoning, then evaluate how these factors vary across languages. To do so, we present a metric, which we call the multilingual Date Fragmentation Ratio (mDFR), to measure the tokenisation quality of dates, then introduce tempo- ral geometry to capture the geometric structures of internal temporal representations. 4.1 Multilingual Date Fragmentation Ratio We extend the Date Fragmentation Ratio (DFR) from Bhatia et al. (2025a) to a multilingual setup, which we call mDFR denoted asF∈[0,1]: F=α 1 1 split +α 2 1 delimiter +α 3 ∆N+α 4 θ(1) Here,1 split and1 delimiter are binary indicators for split semantic roots (e.g., splitting “2024”) and lost separators, respectively.∆Nrepresents the to- ken count inflation relative to a semantic baseline. Finally,θquantifies the structural divergence be- tween the model’s token distribution and the ideal semantic units using cosine distance, as defined in Bhatia et al. (2025a). We calibrate the coefficients αby fitting a linear model to human judgements of fragmentation severity across our target languages. We perform human evaluation of mDFR in Ap- pendix A.2. 4.2 Temporal Geometry Embedding extraction.For each languageℓ∈ EN,DE,ZH,AR,HAand yeary∈[1990,2024] in 3 different date formats (ISO, Slash and Long) we aim to extract a robust representation of the year that is invariant to specific months or days. To do so, we sampleK= 5distinct full dates within year y(e.g., “1995-03-12”, “1995-11-05”) and embed them into declarative templates (e.g., “The date is<date>” orÒÎ t ' P A JÀ @<date>). We propagate these sequences through the model and extract the hidden stateh (ℓ) y,k,i ∈R d corresponding to the final token at layerifor thek-th date sample of yeary. We define the average embedding as: ̄ h (ℓ) y,i = 1 K K X k=1 h (ℓ) y,k,i .(2) Geometric notations.We use several geometric concepts to describe the geometry of time in the embedding space. •Line segment.s (ℓ) y,i is a line segment connect- ing two vectors ̄ h (ℓ) y+1,i and ̄ h (ℓ) y,i , indicating the vector difference between two yearsy+1and yin the embedding space: s (ℓ) y,i = ̄ h (ℓ) y+1,i − ̄ h (ℓ) y,i .(3) •A path of line segments.A sequence of line segments is denoted as P (ℓ) i = s (ℓ) y 1 ,i ,s (ℓ) y 2 ,i ,...,s (ℓ) y T ,i where each line segment connects to the next, forming a path of years from1toTin the embedding space. •The path direction.We denote the overall path direction as the average of line segments: ∆ (ℓ) i = 1 |Y|−1 X y s (ℓ) y,i .(4) If most line segments point in the same direc- tion, then∆ (ℓ) i is stable and represents a clear “forward-in-time” direction for languageℓat thei-th layer. Linear structure of time.We test whether cal- endar values (e.g., a sequence of years 2000,..., 2010) form an underlying linear structure in a 1D subspace of the embedding space. For year calen- dar componentc∈Y(Year), we train a linear re- gressor that decodes the corresponding scalar value from the hidden representation ̄ h (ℓ) y,i . Concretely, we fit ˆc=W c ̄ h (ℓ) y,i +b c ,Linearity(c) =R 2 (c,ˆc), (5) 4 whereR 2 measures how well the Year values can be recovered by a single linear readout. A higher R 2 indicates that the Year component is organised along an approximately ordered axis in the embed- ding space, which may help LLMs perform date arithmetic more effectively. We also apply this idea to Month and Day components. 5 Experiments We examine a diverse suite of decoder-only LLMs to disentangle the effects of model size, archi- tecture, and tokeniser composition. Our open LLMs include the Qwen3 family (spanning from 0.6B to 14B parameters; (Yang et al., 2025)), LLaMA 3 (8B, 70B; (Touvron et al., 2023)), and variants of OLMo (OLMo et al., 2025; Groeneveld et al., 2024), Gemma (Team et al., 2025), Mistral (Mistral-AI et al., 2025), and Phi-4 (Microsoft et al., 2025). To benchmark against proprietary systems, we evaluateGPT-4oandGPT-4o-mini(OpenAI et al., 2024). This selection allows us to isolate tokeniser-induced errors (Singh and Strouse, 2024; Lundin et al., 2025b; Bhatia et al., 2025a,b) from obscure failures in temporal reasoning. 5.1 Tokenisation Setup Baseline vs. model tokenisers.We contrast each model’s native subword segmentation against a deterministic, linguistically informedbaseline to- keniser. This baseline segments date strings into semantic primitives (year, month, day, calendar marker), strictly preserving delimiters and whites- pace. It is designed to be language-aware, correctly parsing Arabic-Indic numerals, Chinese temporal markers (e.g.,年,月), and Hijri suffixes. For each instance inMULTITEMPBENCH, we compute the divergence between the model’s native segmenta- tion (using TikToken or Hugging Face tokenizers) and this semantic baseline. Multilingual date fragmentation ratio (mDFR). We evaluate tokenisation quality using ourmDFR metric. The learned coefficients for the metric areα= (0.2,0.2,0.1,0.5), (Table 7) reflecting that structural divergence (θ), and root splitting are more detrimental than simple token count in- flation. Table 2 provides a qualitative comparison of tokenisation behaviours using the Gemma 3 to- keniser. High-resource languages like German and English exhibit moderate fragmentation (mDFR≈ 0.50–0.53), typically characterised by the splitting of numeric roots (e.g., 2034” becoming2|0|3|4) while largely preserving semantic delimiters and month names. In contrast, low-resource settings suffer from semantic fragmentation; for instance, the Hausa date Oktoba 10, 2034” yields the highest DFR of 0.78, as the month name is broken into opaque sub-word units (O|kt|oba) alongside the numeric splitting. 5.2 Temporal Reasoning Evaluation Setup Prompting strategy.We evaluate models in a zero-shot setting without fine-tuning, chain-of- thought demonstrations, or external knowledge, as these may help LLMs resolve temporal tasks even if date strings are poorly tokenised. Each prompt consists of the question and a concise instruction to output the final answer. LLM-as-a-judge.Given the diverse output for- mats across languages, we employ an LLM-based evaluation pipeline.For every prediction, we generate a JSON record containing the ques- tion, the model’s raw output, and a set of gold- standard aliases (e.g., “03/04/2025”, “3 April 2025”,5 2 0 2 ...K QK . @ 3 ).GPT-4oacts as the judge, classifying the response asCORRECT(consistent with gold aliases),INCORRECT(mutually exclusive), orNOT_ATTEMPTED, which was initially introduced by OpenAI for the QA task (Wei et al., 2024). The automated judge achieved a 87% agreement rate with the majority human vote (Inter annotator agreement Cohen’sκ= 0.89) on a validation set of 250 multilingual instances. (For more details, please see Appendix A.4) 6 Results Our goal is to address the core question:what controls temporal reasoning performance, surface tokenisation of dates, or internal geometric struc- tures of temporal representations? To do so, we first test whetherdate fragmentationpredicts ac- curacy across models and settings (Section 6.2). We then test whethercalendar geometryin hidden states predicts accuracy (Section 6.3). Finally, we synthesise which factor is necessary and/or suffi- cient for strong temporal reasoning (Section 6.4). 6.1 Multilingual Temporal Reasoning Performance Table 3 reports temporal reasoning accuracy aver- aged across the three tasks within each language. Two patterns stand out. First, performance is highly language-dependent: most models are relatively 5 FormatLanguageCalendarOriginal StringBaseline TokenizationGemma 3 Tokenization (Visualized)mDFR D. Month YYYYGermanGreg.10. Oktober 203410 . Oktober 20341|0|.|Oktober|2|0|3|40.50 Month D, YYYYEnglishGreg.October 10, 2034October 10 , 2034October|1|0|,|2|0|3|40.53 Y年M月D日ChineseGreg.2034年10月10日2034年10月10日2|0|3|4|年|1|0|月|1|0|日0.55 D Month YÎArabicHijriÎ 6 5 4 1 I . k . P 7 2Î 6 5 4 1 I . k . P 7 22|7|P|I . k . |1|4|5|6|Î0.60 D Month Y AHEnglishHijri27 Rajab 1456 AH27 Rajab 1456 AH2|7|Raj|ab|1|4|5|6|AH0.60 干支年M月DDChineseLunar辛亥年五月廿三辛亥年五月廿三辛|亥|年|五|月|廿|三0.65 D Month YYYYArabicGreg.4 3 0 2 QK . Ò Jª @ 0 14 3 0 2 QK . Ò Jª @ 0 11|0|QK . Ò Jª @|2|0|3|40.70 D Month YYYYEnglishGreg.10 October 203410 October 20341|0|October|2|0|3|40.75 Month D, YYYYHausaGreg.Oktoba 10, 2034Oktoba 10 , 2034O|kt|oba|1|0|,|2|0|3|40.78 Table 2:Qualitative Analysis of Tokenisation Fragmentation.Vertical bars (|) denote token boundaries within the Gemma 3 tokenizer. Note the severe fragmentation in non-Latin scripts (Arabic, Chinese) and the splitting of month names in Hausa. Accuracy by Language (%) ModelArabicChineseEnglishGermanHausaAverage Proprietary Models GPT-4o71.366.054.370.051.762.7 Open-Weights Models Gemma 3 4B57.364.763.764.046.359.2 Llama 3.1 8B49.065.066.764.341.357.3 Phi-4 Mini39.755.766.362.728.750.6 Qwen 3 4B41.356.754.746.09.041.5 Mistral 7B v0.244.340.751.754.79.040.1 Llama 2 7B15.740.755.051.317.035.9 Gemma 3 1B27.342.740.738.019.333.6 Olmo 3 7B16.339.348.033.312.329.8 DS-R1 Qwen 7B24.748.045.342.31.732.4 OLMo 2 7B16.339.348.033.312.329.9 GPT-OSS 20B5.024.049.020.32.020.0 Qwen 3 14B25.39.719.027.72.316.8 Qwen3 0.6B21.719.714.723.34.016.6 Table 3:Multilingual Temporal Reasoning Accuracy. Accuracy is averaged across the three tasks (date arith- metic, time zone conversion, temporal relation extrac- tion) within each language. strong in high-resource languages (English, Chi- nese, German) but degrade sharply in Hausa, indi- cating a distinct low-resource regime where tem- poral reasoning is brittle. Second, model rank- ing is not explained by model size alone: some smaller open-source LLMs outperform larger ones (for example, the 4B-parameter Gemma 3 achieves a 59.2% average, surpassing both the 8B-parameter Llama 3.1 at 57.3% and the 20B-parameter GPT- OSS at 20.0%), suggesting that multilingual cover- age and training/tokenisation choices outrank raw parameter count for this benchmark. These ac- curacy trends motivate the mechanistic split we test next: low-resource failures align with anin- put accessibilitybottleneck (date fragmentation), whereas high-resource variation is better explained by aninternal geometrybottleneck (temporal lin- earity). CalendarGregorianLunarHijri ModelArZhEnDeHaZhArEnHa Baseline0.000.000.000.000.000.000.000.000.00 GPT-3.50.190.120.230.120.120.410.140.310.30 GPT-40.190.120.230.120.120.410.140.310.30 GPT-50.190.120.230.120.120.410.140.310.30 Llama 20.060.230.420.300.290.200.040.230.23 Phi 3.50.060.230.420.300.290.200.040.230.23 Mistral0.060.230.420.300.290.290.040.230.23 Davinci-0030.170.170.370.100.160.290.090.370.39 OLMo0.130.180.370.090.160.410.090.370.40 Llama 30.350.160.340.120.120.600.360.310.30 DeepSeek0.100.310.440.340.320.610.050.300.29 gpt-oss0.390.160.340.120.130.600.440.320.32 Qwen30.170.320.440.340.320.610.160.310.28 Cohere0.180.340.440.340.330.630.220.320.32 Gemma30.390.340.440.340.330.480.300.310.31 Table 4:Multilingual Date Fragmentation Ratio (mDFR) across models for Gregorian (Ar, Zh, En, De, Ha), Lunar (Zh), and Hijri (Ar, En, Ha). Higher scores indicate greater fragmentation of date tokens. 6.2Date Fragments and Temporal Reasoning Fragmentation varies by language and calendar format.Table 4 reports mDFR across models and calendar varieties. Fragmentation arises from digit splitting (e.g.,2034→2|0|3|4)(Table 2). These effects are amplified in low-resource settings and in less frequent calendar variants: for instance, non- Gregorian formats are often incorrectly tokenised, with more date fragments than Gregorian formats (Table 2), due to sparse calendar markers and lower- frequency month lexemes. Date fragments are a major bottleneck for temporal reasoning in low-resource languages. Figure 2 shows that the greater the mDFR, the lower the accuracy in the date arithmetic task for two low-resource languages: Hausa (r=−0.97) and Arabic (r=−0.89). However, the correlation becomes much weaker for the three high-resource languages: German, Chinese, and English, indi- cating that more date fragments do not cause an accuracy collapse for these languages. Overall, 6 0.10.20.30.4 Fragmentation Ratio 20 40 60 Date Arithmetic Accuracy (\%) Arabic r =0.89 0.350.400.45 Fragmentation Ratio English r =0.17 0.20.3 Fragmentation Ratio German r =0.46 0.150.200.250.30 Fragmentation Ratio Hausa r =0.97 0.200.250.30 Fragmentation Ratio Chinese r =0.15 Figure 2:Impact of Tokenisation on Date Arithmetic Accuracy.DFR is strongly negatively correlated with accuracy in Hausa (r=−0.97), but only weakly correlated in English (r=−0.17). these results suggest that tokenisation is a major bottleneck for temporal reasoning in low-resource languages, but not in high-resource ones: Bhatia et al. (2025a) explained why this is the case in En- glish: they found that LLMs can compensate for date fragmentation by stitching fragmented date tokens during temporal reasoning. We speculate that in high-resource languages, such fragments are still frequently observed in training data, which en- ables LLMs, especially larger models, to internally address/stitch them. We observe similar findings in the other two tasks: time zone conversion and temporal relation (see Figure 6 and Figure 7). 6.3 Geometric Structures and Temporal Reasoning Tokenisation characterises what information is pre- sented to the model, but not whether the model organises that information in a form suitable for computation. We therefore test whether temporal representations possess internal geometric struc- tures that support calendar manipulation. Here we focus ontemporal linearity: how well temporal values (e.g., years) are organised along approxi- mately ordered 1D axes, measured by theR 2 of linear probes decodingDay,Month, andYearfrom hidden states. Temporal linearity is a strong predictor of performance in high-resource languages.Fig- ure 3 plots the average temporal reasoning accu- racy (across date arithmetic, time zone conversion, and temporal relation extraction) against overall temporal linearity (aggregated across all calen- dar components, including delimiters) across mod- els within each language. Overall, temporal lin- earity is strongly correlated with accuracy in En- glish (r= 0.77) and Chinese (r= 0.75), moder- ately correlated in German (r= 0.44) and Ara- bic (r= 0.34), and weakly correlated in Hausa (r= 0.10). This pattern suggests that, once mod- els can reliably access calendar components, strong performance depends on representing temporalval- uesin an ordered geometry that supports arithmetic- like updates. In other words, in high-resource settings the main limiter is not surface form, but whether the model embeds time on a usable internal axis. Component-wise view: the Year axis is typically the most predictive.While Figure 3 summarises the holistic relationship between temporal linearity and accuracy, noting that this overall correlation is distinct from a simple average of the individual components, Figure 4 decomposes this geometry by calendar component (Day/Month/Year) within each language. Two trends stand out. First, cor- relations are generally strongest forYear(espe- cially in English/Chinese), consistent with year values providing the primary ordered backbone required for many temporal operations. Second, MonthandDaylinearity show weaker and more heterogeneous correlations across languages. This suggests that Month and Day representations are not as robustly formed as Year representations, and are instead more sensitive to language- and format-specific cues (e.g., month lexemes and de- limiters) than to a universal ordered axis. Over- all, the component-wise breakdown supports the interpretation that ordered temporal geometry mat- ters most when it provides a stableyearbackbone, and that this signal is clearest in high-resource lan- guages. 6.4 Which Mechanism Controls Temporal Reasoning? To test the relative contribution ofdate fragmen- tationandtemporal linearity, we fit a crossed mixed-effects regression predictingper-question accuracyover all 285000 predictions (15000 ques- tions×19 models). The dependent variable was binary correctness for each prediction. As fixed effects, we included z-scoredmDFR, z-scored linearity,resource level(high-resource vs. low- 7 Figure 3:Temporal linearity vs. accuracy across languages.Temporal linearity (probeR 2 ) is strongly correlated with accuracy in English (r=0.77) and Chinese (r=0.75), but weakly correlated in Hausa (r=0.10), suggesting that ordered temporal geometry is a key driver of high performance when it emerges. 0.20.3 0 50 Accuracy (%) r = 0.22 Arabic 0.15 0.20 0.25 r = 0.54 Chinese 0.15 0.20 0.25 r = 0.65 English 0.20 0.25 r = 0.60 German 0.20.3 r = 0.23 Day Hausa 0.20.30.4 0 50 Accuracy (%) r = 0.49 0.20.3 r = 0.62 0.20.3 r = 0.77 0.20.30.4 r = 0.74 0.10.20.3 r = 0.40 Month 0.4 0.5 R 2 0 50 Accuracy (%) r = 0.34 0.450.500.55 R 2 r = 0.49 0.450.500.55 R 2 r = 0.64 0.450.500.55 R 2 r = 0.64 0.4 0.5 R 2 r = 0.08 Year Figure 4:Component-wise temporal linearity vs. ac- curacy.Correlations between accuracy and probeR 2 forDay,Month, andYearwithin each language. 0.05 0.00 0.05 0.10 0.15 0.20 β (95% CI) mDFR × Lin. × Res. Lin. × Res. mDFR × Res. Resource mDFR × Lin. Linearity mDFR (a) Fixed effects Low-res. (AR, HA) High-res. (EN, DE, ZH) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 |Pearson r | 0.97 0.37 0.43 0.97 (b) Dominant predictor mDFR Linearity Figure 5:Mixed-effects summary of temporal rea- soning bottlenecks.(a) Fixed effects from the crossed mixed-effects regression. (b) Dominant predictor by resource regime: mDFR in low-resource languages, lin- earity in high-resource languages. resource), and all interaction terms:correct∼ mDFR z ∗linearity z ∗resource.We also included crossed random intercepts forquestionandmodel to account for item difficulty and model-specific baseline performance. This analysis lets us test whether temporal reasoning performance is better explained bysurface tokenisation of datesor by internal geometric structures of temporal rep- resentations, and whether this differs by resource level. The regression confirms that temporal rea- soning performance is governed by alanguage- dependent bottleneck. We report regression co- efficients using the notationβ(coefficient),SE (standard error),z(Wald statistic), andp(p-value). Most importantly, the three-way interaction be- tweenmDFR,linearity, andresource levelis significant (β= 0.016,SE= 0.007,z= 2.31, p= 0.021), showing that the dominant predictor changes across language regimes. Inlow-resource languages (Arabic and Hausa), higher fragmenta- tion strongly predicts lower accuracy (β=−0.126, p <0.001), indicating thatdate fragmentation is the dominant bottleneck. Inhigh-resourcelan- guages (English, German, and Chinese),tempo- ral linearityis instead the stronger predictor of accuracy (β= 0.087,p <0.001), while mDFR has only a weak effect (β= 0.009,p= 0.056). These results align with the language-wise analyses above:low-resource languages tend to be input- limited, whereashigh-resource languages tend to be geometry-limited. Figure 5a summarises these results across theMULTITEMPBENCH. Figure 5a also shows that the main effects of resource level and its interactions with mDFR and linearity are the largest fixed effects in the model, consistent with a resource-dependent shift in the dominant bottleneck. As shown in Figure 5b, this yields a clear split:low-resource languages tend to be input-limited, whereashigh-resource languages tend to be geometry-limited. Overall, no sin- gle factor universally controls temporal reasoning across languages; instead, the dominant constraint shifts fromdate fragmentationtotemporal lin- earityas resource level increases. Practically, this distinction matters because resource gaps across languages are expensive and slow to close, whereas linearity gaps may be more tractable through tar- geted interventions, such as re-aligning temporal representations. 7 Conclusion MULTITEMPBENCHshows that multilingual tem- poral intelligence depends on more than adding 8 vocabulary: it requires making temporal informa- tionaccessibleandcomputablein the model’s in- ternal space. A crossed mixed-effects regression confirms this language-dependent bottleneck: in low-resource regimes, date fragmentation is the stronger predictor of failure, while in high-resource regimes temporal linearity is the stronger predictor of performance. Limitations MULTITEMPBENCHis designed as a controlled diagnostic, and that design imposes constraints on generality: it covers five languages (English, Ger- man, Chinese, Arabic, and Hausa) and three task families (date arithmetic, time zone conversion, and temporal relation extraction), so it does not fully represent the diversity of multilingual temporal phe- nomena (e.g., additional scripts and dialects, code- mixing/noisy text, domain-specific jargon, or other calendar conventions beyond those included); in- stances are produced via translation and templated format variation from a curated English seed set, which helps isolate surface-form/tokenisation ef- fects but may under-sample naturally occurring distributions of expressions and errors; we evalu- ate in a zero-shot, direct-answer setting (and nor- malise outputs with an LLM-as-a-judge), which improves comparability yet may understate per- formance under tool use, prompting, or fine-tuning and introduces residual evaluation noise from judge mistakes and format ambiguity; and while we find strong associations between fragmentation metrics, temporal linearity, and performance, our mechanis- tic analyses remain correlational and probe-centric, leaving open causal questions about tokeniser de- sign, training data, decoding dynamics, and non- linear representational structure. A key limitation of our multilingual setup is that the low-resource regime is represented by only two languages (Ara- bic and Hausa), so claims about low-resource tem- poral reasoning should be interpreted as sugges- tive rather than fully general across low-resource languages. More broadly, the language split into high-resource versus low-resource is necessarily coarse, and additional languages are needed to test whether the same bottleneck pattern holds across other typological profiles, scripts, and calendar tra- ditions. Ethical Considerations This benchmark surfaces disparities that can arise from multilingual tokenisation and resource imbal- ance, but such results should be framed as prop- erties of model design and data coverage rather than as inherent deficits of particular languages to avoid reinforcing harmful narratives; because calendar expressions are culturally situated (includ- ing non-Gregorian systems such as Hijri and Chi- nese Lunar), conversion or formatting errors can have real consequences in downstream, potentially high-stakes contexts, so users should document conversion assumptions and validate systems with native-speaker and domain-expert review when de- cisions matter; the dataset is constructed from pub- lic sources through translation and controlled trans- formations and is not intended to contain personal data, yet extensions should avoid introducing iden- tifiable or sensitive information and, the human annotation is used follows informed consent and fair compensation practices; finally, any depen- dence on third-party model APIs for translation and/or evaluation can affect reproducibility and raise governance concerns, so releases will doc- ument versions and settings and, where feasible, provide open alternatives, while acknowledging that improved temporal reasoning can be dual-use and warrants domain-specific risk assessment and human oversight in sensitive deployments. Broader Impact By providing a controlled multilingual temporal benchmark and analysis signals (e.g., fragmenta- tion and representation geometry probes), this work can help the community audit and improve tempo- ral reasoning across scripts, languages, and calen- dar conventions, potentially reducing “token tax” effects and improving language equity in multi- lingual NLP; it may also guide more principled tokeniser and training-data interventions by link- ing surface segmentation properties to downstream competence; however, like any benchmark, it can distort incentives if treated as a leaderboard target, encouraging optimisation for templated formats or discouraging support for languages that score poorly, so we emphasise its role as a diagnostic instrument rather than a deployment-readiness test and encourage follow-on work to broaden coverage (more languages/dialects and naturalistic temporal text), evaluate mitigation strategies directly, and report results with uncertainty and careful error 9 analysis to support responsible, inclusive progress. References Orevaoghene Ahia, David de Almeida, Nathan Shleifer, and Emily Dinan. 2023. Do all languages cost the same? tokenization in the era of commercial lan- guage models.arXiv preprint arXiv:2305.13707. Mohammed H Alshehri. 2024. Hijridate: A python package for hijri-gregorian date conversion. Gagan Bhatia, Maxime Peyrard, and Wei Zhao. 2025a. Date fragments: A hidden bottleneck of tokenization for temporal reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 3201–3219, Suzhou, China. Association for Computational Linguistics. Gagan Bhatia, Ming Ze Tang, Cristina Mahanta, Madiha Kazi, Maxime Peyrard, and Wei Zhao. 2025b. Date- LogicQA: Benchmarking temporal biases in large language models. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 321–332, Albuquerque, USA. As- sociation for Computational Linguistics. Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang, Ming Liu, and Bing Qin. 2023. Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models.arXiv preprint arXiv:2311.17667. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobs- son, Idan Szpektor, Nan-Jiang Jiang, and 3416 oth- ers. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.Preprint, arXiv:2507.06261. Ahmed Oumar El-Shangiti, Tatsuya Hiraoka, Hilal AlQuabeh, Benjamin Heinzerling, and Kentaro Inui. 2025. The geometry of numerical reasoning: Lan- guage models compare numeric properties in linear subspaces. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (Volume 2: Short Papers), pages 550–561, Albuquerque, New Mexico. Association for Computational Linguistics. Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, and Bryan Per- ozzi. 2024. Test of time: A benchmark for evalu- ating llms on temporal reasoning.arXiv preprint arXiv:2406.09170. Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bha- gia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khy- athi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, and 24 others. 2024. Olmo: Accelerating the science of language models.Preprint, arXiv:2402.00838. Wes Gurnee and Max Tegmark. 2024. Language mod- els represent space and time. InProceedings of the 12th International Conference on Learning Represen- tations. ICLR 2024. Yicheng Han, Shih-Ming Wang, Jialu Zhang, Qian Liu, and Wei Lu. 2025. Ticktack: Modeling temporal relationships in llms using non-gregorian calendars. arXiv preprint arXiv:2503.04150. Carolin Holtermann, Paul Röttger, and Anne Lauscher. 2025. Around the world in 24 hours: Probing LLM knowledge of time and place. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 22875–22897, Vienna, Austria. Association for Com- putational Linguistics. Duygu Sezen Islakoglu and Jan-Christoph Kalo. 2025. Chronosense: Exploring temporal understanding in large language models with time intervals of events. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Short Pa- pers). ACL 2025. Vani Kanjirangat, Tanja Samardži ́ c, Ljiljana Dolamic, and Fabio Rinaldi. 2025. Tokenization and represen- tation biases in multilingual models on dialectal NLP tasks. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. EMNLP 2025. N. J. Karthika, Maharaj Brahma, Rohit Saluja, Ganesh Ramakrishnan, and Maunendra Sankar Desarkar. 2025. Multilingual tokenization through the lens of indian languages: Challenges and insights.arXiv preprint arXiv:2506.17789. Linus Kreitner, Paul Hager, Jonathan Mengedoht, Geor- gios Kaissis, Daniel Rueckert, and Martin J. Menten. 2025.Efficient numeracy in language models through single-token number embeddings.arXiv preprint arXiv:2510.06824. Aochong Oliver Li and Tanya Goyal. 2025. Memo- rization vs. reasoning: Updating LLMs with new knowledge. InFindings of the Association for Com- putational Linguistics: ACL 2025. ACL Findings 2025. Huihan Li, You Chen, Siyuan Wang, Yixin He, Ninareh Mehrabi, Rahul Gupta, and Xiang Ren. 2025. Diag- nosing memorization in chain-of-thought reasoning, one token at a time. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing. EMNLP 2025. Zefang Liu, Nam H. Nguyen, Yinzhu Quan, and Shi- Xiong Zhang. 2025. Temporal tokenization strategies for event sequence modeling with large language models.Preprint, arXiv:2512.13618. Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, and Cody Carroll. 2025a.The token tax:Systematic 10 bias in multilingual tokenization.arXiv preprint arXiv:2509.05486. Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, and Cody Car- roll. 2025b. The token tax: Systematic bias in multi- lingual tokenization.Preprint, arXiv:2509.05486. Siddarth Mamidanna, Daking Rai, Ziyu Yao, and Yilun Zhou. 2025. All for one: LLMs solve mental math at the last token with information transferred from other tokens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. EMNLP 2025. Vittorio Mazzia, Sandro Pollastrini, Davide Bernardi, Chiara Rubagotti, and Daniele Amberti. 2026. Benchmarking multilingual temporal reasoning in LLMs: The temporal reasoning dataset. InProceed- ings of the 16th International Workshop on Spoken Dialogue System Technology, pages 168–181, Trento, Italy. Association for Computational Linguistics. Zeyu Miao and 1 others. 2025a. Benchmarking and improving cross-calendar temporal reasoning of large language models.Preprint, arXiv:2511.09993. Zhongjian Miao, Hao Fu, and Chen Wei. 2025b. Span: Benchmarking and improving cross-calendar tempo- ral reasoning of large language models.Preprint, arXiv:2511.09993. Microsoft, :, Abdelrahman Abouelenin, Atabak Ash- faq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dong- dong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi ling Chen, Qi Dai, and 57 others. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. Preprint, arXiv:2503.01743. Mistral-AI, :, Abhinav Rastogi, Albert Q. Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, Jason Rute, Joep Barmentlo, Karmesh Yadav, Kartik Khan- delwal, Khyathi Raghavi Chandu, Léonard Blier, Lu- cile Saulnier, Matthieu Dinot, Maxime Darrin, Neha Gupta, Roman Soletskyi, Sagar Vaze, and 82 others. 2025. Magistral.Preprint, arXiv:2506.10910. Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groen- eveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, and 24 others. 2025. 2 olmo 2 furious.Preprint, arXiv:2501.00656. OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ̨adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. Gpt-4o system card.Preprint, arXiv:2410.21276. Guilherme Penedo, Hynek Kydlí ˇ cek, Vinko Sabol ˇ cec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. 2025. Fineweb2: One pipeline to scale them all–adapting pre-training data processing to every language.arXiv preprint arXiv:2506.20920. Aleksandar Petrov, Emanuele La Malfa, Philip HS Torr, and Adel Bibi. 2023. Language model tokeniz- ers introduce unfairness between languages.arXiv preprint arXiv:2305.15425. Piotr P ̨ezik, Konrad Kaczy ́ nski, Maria Szyma ́ nska, Filip ̇ Zarnecki, Zuzanna Deckert, Jakub Kwiatkowski, and Wojciech Janowski. 2025. Llmlagbench: Identify- ing temporal training boundaries in large language models.Preprint, arXiv:2511.12116. Mutsumi Sasaki, Go Kamoda, Ryosuke Takahashi, Ko- suke Sato, Kentaro Inui, Keisuke Sakaguchi, and Benjamin Heinzerling. 2025. Can language mod- els handle a non-gregorian calendar? the case of the Japanese wareki. InProceedings of the 14th In- ternational Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Lin- guistics, pages 444–463, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics. Rohit Saxena, Aryo Pradipta Gema, and Pasquale Min- ervini. 2025. Lost in time: Clock and calendar un- derstanding challenges in multimodal LLMs.arXiv preprint arXiv:2502.05092. Aaditya K. Singh and D. J. Strouse. 2024. Tokenization counts: The impact of tokenization on arithmetic in frontier LLMs.arXiv preprint arXiv:2402.14903. Dimitris Spathis and Fahim Kawsar. 2023. The first step is the hardest: Pitfalls of representing and tokeniz- ing temporal data for large language models.arXiv preprint arXiv:2309.06236. Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2019. Asynchronous pipeline for process- ing huge corpora on medium to low resource infras- tructures. In7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz- Institut für Deutsche Sprache. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. Gemma 3 technical report. Preprint, arXiv:2503.19786. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others. 2023. Llama: Open and efficient foundation language models.arXiv preprint. Yuqing Wang and Yun Zhao. 2024. Tram: Benchmark- ing temporal reasoning for large language models. InFindings of the Association for Computational Linguistics: ACL 2024. Zhengxiang Wang and Zeyu Dong. 2026. Measur- ing iterative temporal reasoning with time puzzles. Preprint, arXiv:2601.07148. 11 Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. 2024. Measuring short-form factuality in large language models.arXiv preprint arXiv:2411.04368. Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, and Houfeng Wang. 2025. Time: A multi-level benchmark for tem- poral reasoning of LLMs in real-world scenarios. In Advances in Neural Information Processing Systems. NeurIPS 2025. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen3 technical report.Preprint, arXiv:2505.09388. Chenghao Zhu, Nuo Chen, Yufei Gao, Yunyi Zhang, Prayag Tiwari, and Benyou Wang. 2024. Evaluating LLMs at Evaluating Temporal Generalization. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers). A Appendix A.1 Creation of our MULTITEMPBENCH To ensure consistency across the multilingual benchmark, we implemented a unified process- ing pipeline. This pipeline processes the English source data and generates language-specific vari- ants for Arabic, Chinese, Hausa, German, and En- glish. The process consists of two stages:Stan- dardizationandPolymorphic Formatting. A.1.1 Stage 1: Date Extraction and Standardization The first step is identical for all five languages. We utilize a regular expression to identify date entities within the source text. Regardless of the input for- mat, these dates are parsed into a standard internal representation (Year, Month, Day). This ensures that all downstream formatters operate on a consis- tent temporal grounding. A.1.2 Stage 2: Polymorphic Formatting Once standardized, the pipeline applies four distinct formatters per language:ISO,Slash (Numeric),Long(Textual), andCalendar (Phrasal/Cultural). The specific logic for each language is detailed below. The conversion pro- cess was implemented through a unified Python pipeline. For each language, the system first ex- tracts and parses dates from the source English questions into a standard internal representation. Language-specific formatters are then applied. For instance, Arabic formatting involves converting digits to Arabic-Indic numerals, applying right- to-left marks for ISO dates, and using the hijri- converter library to generate Hijri calendar dates (e.g., Î4 4 4 1 Èj . m à '@ X @ ). Similarly, Chinese formatting integrates conver- sions to the traditional lunar calendar. Arabic Implementation.The Arabic formatting pipeline requires specific handling for text direc- tionality and numeral systems. •ISO Format:To prevent rendering issues in Right-to-Left (RTL) contexts, the standard ISO string is wrapped in Unicode Left-to- Right Marks (LRM,U+200E). • Long Format:We map Gregorian month in- dices to their Arabic counterparts (e.g.,July →Yuliyu) and convert Western Arabic numer- als (0-9) to Eastern Arabic-Indic numerals ( 9 0). •Calendar (Hijri) Format:We utilize the hijri-converterlibrary to transform the Gregorian date into the Hijri calendar. The resulting day, month, and year are formatted using standard Hijri month names (e.g.,Ra- madan,Shawwal). Chinese Implementation.Chinese formatting emphasizes the use of component suffixes and Lu- nar conversion. •Long Format:Adheres to the standard East Asian order (Year-Month-Day) with the re- spective character suffixes (年,月,日). •Calendar (Lunar) Format:We convert the Gregorian date to the Chinese Lunar calen- dar using thelunarcalendarlibrary. The numeric years are converted to their Chinese character equivalents (e.g., 2023→二零二 三 ), and months are mapped to their tradi- tional lunar representations. 12 Hausa Implementation.Hausa formatting in- tegrates Islamic cultural elements with standard Gregorian tracking, reflecting the region’s dual- calendar usage. •Long Format:Uses the particle “ga” (mean- ing “on”) to connect the day and the month (e.g.,03 ga Afrilu 2023). •Calendar Format:In this variant, we utilize the locally recognized Islamic month names (e.g.,Ramadan,Shawwal) while maintaining the Gregorian year for clarity in civil contexts. German Implementation.German requires spe- cific grammatical phrasings for the “Calendar” variant to represent a formal date expression. While the standard formats use dot separators (D.M.Y), the calendar variant expands this to a formal phrase: “Am [Day]. [Month] des Jahres [Year]” (e.g.,Am 26. Juni des Jahres 2025). A.1.3 Examples Example of ourMULTITEMPBENCHis provided in Table 5. A.2 Validation of Multilingual Date Fragmentation Ratio (mDFR) This appendix provides a detailed account of the formulation and two-part validation process for our custom Multilingual Date Fragmentation Ratio (mDFR). We demonstrate that this metric aligns closely with human intuition regarding semantic disruption and relies on empirically sound weight- ings. A.2.1 Metric Formulation We calculate the structural divergenceθbetween the model’s token count vectortand the seman- tic baseline vectorbusing cosine distance. This metric quantifies the deviation of the model’s to- kenisation from an ideal semantic segmentation. The final mDFR score,F∈[0,1], is constructed as a weighted sum of four specific error components: whether semantic roots are split (1 split ), whether delimiters are lost (1 delimiter ), the increase in to- tal token count (∆N), and the distributional diver- gence (θ). A.2.2 Human Evaluation of Fragmentation Severity This study was designed to confirm that our F met- ric captures what humans perceive as semantic dis- ruption in tokenized dates more effectively than general-purpose text similarity metrics. Methodology.We recruited five computer sci- ence graduate students, who were familiar with NLP but blind to our hypotheses, to serve as anno- tators. We created a stimulus set of 100 tokenised date strings, stratified to represent a wide range of models, date formats, and fragmentation lev- els from our experiments. For each item, annota- tors were shown the original date and the list of sub-tokens, and asked to rate the“fragmentation severity”on a 5-point Likert scale, according to the following rubric: • 1 (No Fragmentation):Tokens perfectly pre- serve the semantic components. • 2 (Minor Fragmentation):Mostly preserved, with minor, non-ideal splits. •3 (Moderate Fragmentation):Core compo- nents are broken, making the structure harder to discern. Delimiters might be lost or num- bers oddly grouped. •4 (High Fragmentation):Date split into many small pieces (e.g., single digits), though the original characters are easily reassembled. •5 (Severe Fragmentation):tokenisation com- pletely obscures the date’s structure, often by adding non-numeric tokens or creating highly unintuitive groupings. The human judgments were highly reliable, with a Krippendorff’s Alpha for inter-annotator agree- ment ofα= 0.81. Results.We computed the Spearman’s rank cor- relation coefficient (ρ) between the average human rating for each item and the scores from our F met- ric, BLEU, and character-level Edit Distance. As shown in Table 6, our F metric demonstrated a strong correlation with human ratings (ρ= 0.89), far exceeding general-purpose metrics like BLEU (ρ= 0.43). A.2.3 Data-Driven Validation of Metric Coefficients To directly tune our metric to align with human perception, we framed the weight determination as a linear regression problem. The goal was to predict the average human severity rating using the four fragmentation components as features:x= [1 split ,1 delimiter ,(N−N b ),θ]. After fitting the model to our human evaluation data, we obtained a set of empirically derived coef- ficients. As shown in Table 7, the weights learned from human ratings are remarkably similar to the 13 TaskRaw Fmt LngSizeExampleGT Arithmetic 250455,000In a movie, the tower took exactly 14 years to construct. They started in 2000-12-27. When was it ready? 2014-12-27 Time Zone 250455,000 If it’s 2 AM on 1352-03-02 in Asia/Singapore, what’s the date and time in Europe/Athens? 8 PM on 1352-03-01 Relation250455,000 Rules for lending against stocks and unit trusts were also rede- fined. What is the relationship between the event ‘redefined’ and the time ‘April 1, 1997’? IS_INCLUDED Total7506515,000 Table 5:Overview of tasks in theMULTITEMPBENCHdataset.“Raw” denotes unique English questions. “Size” is the total number of examples after multilingual/format expansion (Raw×4Fmt×5Lang). The Truth column shows the expected answer format. We have6unique date formats. MetricCorrelation (ρ) mDFR0.89 DFR (Bhatia et al., 2025a)0.81 BLEU Score0.43 Character-Level Edit Distance0.29 Table 6: Spearman Correlation (ρ) of Metrics with Hu- man Judgments of Fragmentation Severity for Multilin- gual dates. normalised version of our original, intuitively set weights. This confirms that Distributional Diver- gence (θ) is the dominant factor in perceived sever- ity, followed by structural breaks, with token count inflation playing a minor role. A.2.4 Qualitative Analysis of Fragmentation To visualise how mDFR scores correspond to real- world model outputs, we analysed tokenisation pat- terns across different languages and scripts. Table 2 illustrates the correlation between high mDFR scores, human severity ratings, and severe segmen- tation issues. Notably, non-Latin scripts (e.g., Ara- bic, Chinese) and agglutinative languages often suffer from higher fragmentation (rated 4.6–5.0 by humans), where semantic roots are often shattered into single characters or bytes. A.3 Correlation of the different tasks The same broad pattern holds in the other two tasks, though the strength of the effect varies by task. Intemporal relation extraction(Figure 6), higher fragmentation is associated with lower accu- racy in the two low-resource languages, especially Hausa (r=−0.58), while the relationship remains weak or near-zero in English (r= 0.06), German (r= 0.28), and Chinese (r= 0.08). Arabic also shows a modest negative correlation (r=−0.29). As in date arithmetic, these results suggest that date fragmentationis more consequential in low- resource settings, whereas high-resource languages are generally more robust to fragmented temporal inputs. A similar but slightly stronger pattern appears in time zone conversion(Figure 7). Fragmentation is negatively correlated with accuracy in Arabic (r=−0.54) and especially Hausa (r=−0.74), but remains weak in English (r=−0.15), German (r=−0.01), and Chinese (r=−0.13). Com- pared with temporal relation extraction, time zone conversion shows a clearer low-resource penalty, though still less extreme than the effect observed for date arithmetic. Overall, across all three tasks, date fragmentationis most predictive of failure in low-resource languages, supporting the view that tokenisation is a regime-dependent bottleneck rather than a universal explanation of temporal rea- soning errors. A.4 Human Evaluation Details. To validate the reliability of the LLM-based judg- ing pipeline, we conducted a human evaluation on a subset of the benchmark. Six annotators participated in the study, all of whom were Mas- ter’s students in computer science or closely re- lated disciplines. For each language included in the validation set, at least two annotators inde- pendently reviewed the model outputs and deter- mined whether the response should be classified asCORRECT,INCORRECT, orNOT_ATTEMPTED. The evaluation covered multiple languages present in the benchmark to ensure that linguistic diversity did not bias the assessment. Disagreements were resolved using majority voting across annotators. Across the evaluated instances, the human annota- tors achieved an average agreement rate of approx- imately 89%, indicating strong consistency in the 14 Fragmentation ComponentOriginal Intuitive WeightEmpirically Learned Weight (Normalised)(from Human Ratings) 1 split (Component Split)0.18180.2015 1 delimiter (Delimiter Loss)0.18180.1932 N−N b (Token Difference)0.09090.1053 θ(Distributional Divergence)0.54550.5000 Table 7: Comparison of Original (Normalised) and Empirically Learned Weights for the F Metric. FormatLanguageCalendarOriginal StringBaseline tokenisationGemma 3 tokenisation (Visualized)mDFRAvg. Human Rating D. Month YYYYGermanGreg.10. Oktober 203410 . Oktober 20341|0|.|Oktober|2|0|3|40.504.2 Month D, YYYYEnglishGreg.October 10, 2034October 10 , 2034October|1|0|,|2|0|3|40.534.4 Y年M月D日ChineseGreg.2034年10月10日2034年10月10日2|0|3|4|年|1|0|月|1|0|日0.554.6 D Month YÎArabicHijriÎ 6 5 4 1 I . k . P 7 2Î 6 5 4 1 I . k . P 7 22|7|P|I . k . |1|4|5|6|Î0.605.0 D Month Y AHEnglishHijri27 Rajab 1456 AH27 Rajab 1456 AH2|7|Raj|ab|1|4|5|6|AH0.605.0 D Month YYYYArabicGreg.4 3 0 2 QK . Ò Jª @ 0 14 3 0 2 QK . Ò Jª @ 0 11|0|QK . Ò Jª @|2|0|3|40.705.0 D Month YYYYEnglishGreg.10 October 203410 October 20341|0|October|2|0|3|40.755.0 Month D, YYYYHausaGreg.Oktoba 10, 2034Oktoba 10 , 2034O|kt|oba|1|0|,|2|0|3|40.785.0 Table 8:Qualitative Analysis of Tokenisation Fragmentation.Vertical bars (|) denote token boundaries within the Gemma 3 tokeniser. TheAvg. Human Rating(1-5 scale) confirms that higher mDFR scores correspond to perceived severe fragmentation. 0.10.20.30.4 Fragmentation Ratio 20 40 60 Temporal Relation Accuracy (\%) Arabic r =0.29 0.350.400.45 Fragmentation Ratio English r = 0.06 0.20.3 Fragmentation Ratio German r = 0.28 0.150.200.250.30 Fragmentation Ratio Hausa r =0.58 0.200.250.30 Fragmentation Ratio Chinese r = 0.08 Figure 6:Impact of Tokenisation on Temporal Relation Task Accuracy.The scatter plots show the correlation between Date Fragmentation Ratio (DFR) and temporal reasoning accuracy for each language. 0.10.20.30.4 Fragmentation Ratio 20 40 60 Time Zone Conversion Accuracy (\%) Arabic r =0.54 0.350.400.45 Fragmentation Ratio English r =0.15 0.20.3 Fragmentation Ratio German r =0.01 0.150.200.250.30 Fragmentation Ratio Hausa r =0.74 0.200.250.30 Fragmentation Ratio Chinese r =0.13 Figure 7:Impact of Tokenisation on Time Zone Conversion Task Accuracy.The scatter plots show the correlation between Date Fragmentation Ratio (DFR) and temporal reasoning accuracy for each language. evaluation criteria. This agreement level provides additional confidence in the reliability of the auto- mated LLM-as-a-judge evaluation protocol used in our experiments. A.5 Temporal Geometry PCA visualization across layers.To provide a qualitative view of how temporal structure emerges across depth, we apply PCA to the set of points ̄ h (ℓ) y,i fory∈[1990,2024]across the five lan- guages. In the visualizations (Figure 8), the re- sulting plots display the sequence of line segments connecting consecutive years, revealing whether languages form coherent, linear paths in the em- bedding space. A.6 LLM as judge Prompts Our prompt for LLM-as-judge is illustrated in Ta- ble 10. 15 (a)Layer 0 (Input):Chaotic trajectories dominated by surface-level tokenisation fragmentation. (b)Layer 7 (Early):Representations separate by language syntax; global linear organization has not yet formed. (c)Layer 14 (Middle):High-resource languages (EN, DE, ZH) begin to straighten; low-resource (HA) remains curved. (d)Layer 21 (Reasoning):TheGeometric Language Tax. EN/DE/ZH form near-linear year trajectories useful for arith- metic, while HA remains a non-linear cluster. (e)Layer 27 (Output):Final separation of language clusters to prepare for distinct lexical decoding. Figure 8:Evolution of temporal organization across layers.PCA projections of year centroid embeddings (1990–2024) in Qwen 3. The plots show a progression from input-level fragmentation (Layer 0) to temporally structured, approximately linear trajectories in mid-to-deep layers for high-resource languages, while Hausa fails to linearize, remaining geometrically misaligned with the year-structured axis. 16 Human Evaluation Context-based resolution Prompt: Who was the chair of Allgemeiner Deutscher Fahrrad-Club in 17/10/2016? Gold Answer: Ulrich Syberg Model Prediction: As of October 17, 2016, the Federal Chairman was Ulrich Syberg Human Annotator Rating:CORRECT LLM-as-Judge Rating:CORRECT Date arithmetic Prompt: What date is 60 days after 05/01/1225? Gold Answer: March 6, 1225 , June 29, 1225 Model Prediction: July 30, 1225 Human Annotator Rating: INCORRECT LLM-as-Judge Rating:INCORRECT Table 9: Human evaluation of LLM-as-judge. 17 LLM-as-Judge Evaluation Prompt Your task:Evaluate one prediction at a time. You receive: •Question– the task prompt shown to the model •Gold target–allanswers that are considered correct •Predicted answer– the model’s response Returnone letter only: ACORRECTprediction fully matchesonegold variant BINCORRECTprediction contradicts or misses required info CNOT_ATTEMPTEDprediction refuses, guesses, or answers irrelevantly General rules: 1. Match semantics, ignore capitalisation, punctuation, order. 2. If any statement contradicts the gold target, gradeB. 3. Hedging ("I think. . . ") is fine if the correct info is present and no incorrect info is added. 4. Partial answers areB. Typos that preserve meaning are allowed. DateAugBench specifics: •Date format ambiguity: gold lists every valid interpretation; accept any. •Date arithmetic: prediction must matchday, month, yearof a listed variant, any textual format allowed. •Format-switch questions: answer with any synonym ofYes/TrueorNo/False. •Numeric answers– must match the gold number to the last shown significant digit. Output format Return exactly one capital letter: AorBorC No additional text or punctuation. Example template Question: question Gold target: target Predicted answer: predicted_answer Now grade: AorBorC Table 10: LLM-as-Judge prompt used for comparing model and gold answers in the three tasks inMULTITEMP- BENCH. 18