Paper deep dive

A Closer Look into LLMs for Table Understanding

Jia Wang, Chuanyu Qin, Mingyu Zheng, Qingyi Si, Peize Li, Zheng Lin

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 105

Abstract

Abstract:Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern -- early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

104,773 characters extracted from source content.

Expand or collapse full text

A Closer Look into LLMs for Table Understanding Jia Wang 1,2∗ , Chuanyu Qin 1,2∗ , Mingyu Zheng 1,2∗ , Qingyi Si 3 , Peize Li 1 , Zheng Lin 1,2‡ 1 Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2 School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3 JD.COM wangjia,linzheng@iie.ac.cn Abstract Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, cover- ing general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and per- form downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key find- ings include: (1) LLMs follow a three-phase attention pattern—early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in mid- dle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further en- hanced by table-tuning. We hope these find- ings and insights can facilitate interpretability and future research on table-related tasks. Our code and analysis pipeline is publicly avail- able athttps://github.com/JiaWang2001/ closer-look-table-llm 1 Introduction Tables, as a representative form of structured data, are widely used across various real-world fields to store and present information. In recent years, Large Language Models (LLMs) have shown powerful instruction following and complex rea- soning ability, and thus have become the domi- nant paradigm for table understanding technique, supporting a wide range of downstream applica- tion scenarios, such as table question answering (TQA) (Pasupat and Liang, 2015; Zhang et al., ∗ Indicates equal contribution. ‡ Corresponding author: Zheng Lin. 2024; Deng and Mihalcea, 2025), table fact ver- ification (TFV) (Chen et al., 2020; Wang et al., 2024b), advanced data analysis (Li et al., 2025; Weng et al., 2025; OpenAI, 2024) and spreadsheet manipulation (Li et al., 2023; Microsoft, 2025). Although remarkable success has been achieved with existing LLM-based table understanding ap- proaches, they predominantly focus on how to continually improve performance metrics across downstream tasks, e.g., boosting LLMs’ perfor- mance with the most suitable prompt designs (Sui et al., 2024; Ye et al., 2023) and developing spe- cialist tabular LLMs with supervised fine-tuning (SFT) (Deng and Mihalcea, 2025; Zhang et al., 2024, 2025) or reinforcement learning (RL) (Yang et al., 2025b; Wu et al., 2025; Kang et al., 2025b). Unlike previous performance-oriented work, in this paper, we conduct an in-depth empirical study to achieve a deeper understanding of the underlying mechanisms of LLMs’ table understanding. Our experiments cover a wide spectrum of LLMs of dif- ferent sizes and types, and focuses on four perspec- tives including the attention dynamics, the effective layer depth, the impact of MoE architectures and the influence of input designs. The explored re- search questions are shown in below. Q1: How do intra-table and extra-table attention patterns of LLMs evolve when performing table understanding tasks? Q2: How many effective layers are utilized by LLMs to achieve stable predictions and are there difference between tabular and general tasks? Q3: Are there specialized table experts for MoE models and where are they located? Q4: The influence of different table formats and reasoning strategies on internal representations. We conclude our key findings as follows: (1) Different LLM layers exhibit distinct at- tention patterns during table understanding. The early layers broadly scan the overall table con- tent, the middle layers then concentrate attention 1 arXiv:2603.15402v1 [cs.CL] 16 Mar 2026 on query-related cells and the top layers further amplify the focused content representations to pro- duce the final answer. (2) Tabular tasks require more layers for out- put distribution refinement than math reasoning. While the depth at which final answer content crys- tallizes is broadly comparable across tasks, LLMs refine output distributions across more layers when processing tabular data—a pattern consistent across model scales and training strategies. (3) MoE models develop specialized experts for tabular tasks in the middle layers, which show minimal overlap with math-oriented experts, but early and top layers share general-purpose ex- perts across different tasks. (4) Input formats and reasoning strategies indeed influence attention patterns. HTML ta- bles result in more dispersed attention distributions in early layers than Markdown tables. Chain-of- Thought (CoT) prompting can allocate more atten- tion to the table content, which can be further am- plified by table-specific fine-tuning and thus leads to better performance. To the best of our knowledge, we present the first thorough investigation of the mechanisms un- derlying LLM-based table understanding. Our find- ings not only provide new insights into LLM in- terpretability in the tabular domain, but also offer guidelines for future research—from optimal in- put configurations that maximize attention to rele- vant table content, to inference-time interventions in later layers, to MoE optimizations for table- specialized experts. 2 Related Work 2.1 LLM-based Table Understanding Table understanding (TU) technique aims to en- able models to automatically comprehend tables and perform various tabular tasks based on user re- quests (Shigarov, 2023; Lu et al., 2025). With the rapid iteration of LLMs, their ability has opened new possibilities for more intelligent TU applica- tions. One line of research endeavours to enhance LLMs’ table understanding ability through differ- ent strategies. For instance, eliciting correct rea- soning steps with prompt engineering and carefully selected in-context examples (Chen, 2023; Ye et al., 2023; Wang et al., 2024b; Ji et al., 2024; Jiang et al., 2023), collecting table instruction tuning data for supervised fine-tuning (Zhang et al., 2024, 2025), exploring new tasks and reward strategies for rein- forcement learning (Yang et al., 2025b; Wu et al., 2025; Zhou et al., 2025a), and building powerful table agents for multi-turn data analysis and excel manipulation (Li et al., 2023; Weng et al., 2025). In addition to the performance-oriented stud- ies mentioned above, another valuable direction is to explore the robustness and interpretability of LLMs’ table-related capabilities. For exam- ple, evaluating models’ performance against di- vergent perturbations such as table formats (Sui et al., 2024), noisy operators (Singha et al., 2023) and cell replacement (Bhandari et al., 2025). Com- pared with existing work that mainly focused on the performance robustness, we aim at seeking a bet- ter understanding of the underlying mechanisms of LLM-based table understanding and provide valu- able insights for future follow-ups. 2.2 Interpretability and Analysis of LLMs Despite the strong capabilities of LLMs, under- standing their internal mechanisms remains a key challenge. Some studies focus on the roles of different layers in LLMs, examining how indi- vidual layers contribute to information process- ing (Skean et al., 2025) and how semantic rep- resentations are gradually constructed across lay- ers (Csordás et al., 2025; Hu et al., 2025). Other lines of work investigate specific internal com- ponents, such as attention heads in self-attention mechanisms and feed-forward network (FFN) mod- ules, with the aim of clarifying their roles in knowl- edge representation (Geva et al., 2021; Meng et al., 2022) and information acquisition (Wu et al., 2024; Kobayashi et al., 2023), revealing various intrigu- ing phenomena within LLMs, such as attention sink (Kobayashi et al., 2020; Xiao et al., 2023; Gu et al., 2024; Qiu et al., 2025; Queipo-de Llano et al., 2025). Furthermore, with the growing adoption of Mixture-of-Experts (MoE) architectures, recent re- searches (Wang et al., 2024a; Bai et al., 2025; Lo et al., 2025; Bandarkar et al., 2025; Su et al., 2025) analyze expert modules within MoE models, ex- ploring their specialization patterns and activation behaviors during task execution. While these efforts have provided valuable in- sights (Si et al., 2023; Bavaresco et al., 2025), they predominantly focus on unstructured text tasks such as math reasoning and factual recall. In this work, we extend interpretability research to struc- tured table data, conducting analyses to explore how LLMs process tables and offering preliminary findings across several dimensions. 2 Table 1: Summary of 2,000 evaluation samples used in our empirical study. DatasetTaskStructureAvg Tokens#Samples WTQTQAFlat1012500 HiTabTQAHierarchical964500 AITQATQAHierarchical651500 TabFactTFVFlat681500 3 Experimental Setups Evaluation Data We randomly select 500 sam- ples from 3 TQA benchmarks and 1 TFV bench- mark to perform empirical study, which include WTQ (Pasupat and Liang, 2015), HiTab (Cheng et al., 2021), AIT-QA (Katsis et al., 2021) and Tab- Fact (Chen et al., 2020). The resulting 2,000 test samples cover common table structures with flat headers (WTQ, TabFact) and complex hierarchical headers (HiTab, AIT-QA). The input tables are se- rialized into Markdown format by default, and we also analyze the impact of alternative formats (e.g., HTML) in Section 7.1. The complete input prompt templates are shown in Appendix C.1. Analyzed LLMs We analyze 16 LLMs of three categories. (1) general-purpose instruct LLMs such as Llama-3.1-8B-Instruct and Qwen-2.5-7B- Instruct. (2) fine-tuned tabular LLMs via SFT or RL such as TAMA (Deng and Mihalcea, 2025) and Table-R1-Zero (Yang et al., 2025b).(3) Mixture-of-Experts (MoE) models like DeepSeek- V2-Lite (DeepSeek-AI et al., 2024) and Qwen3- 30B-A3B (Yang et al., 2025a). The complete model list is provided in Appendix C.3. 4 Analysis of Attention Dynamics To understand how LLMs allocate attention during table question answering, we design a controlled experimental setup where each input instance com- prises three segments: (1) a system prompt provid- ing general instructions, (2) a table content con- taining the serialized table, and (3) a user question specifying the query. This segmentation allows us to trace attention flow across different input compo- nents. We investigate three progressive questions: §4.1: How does the model attend to three input seg- ments, especially different table cells? §4.2: How much does the table content actually contribute to the model’s final output? §4.3: Does this attention pattern causally affect model predictions? The following metrics are employed in our anal- ysis. (1) segment attention ratio measures the pro- portion of attention allocated to each segment per layer; (2) table attention entropy quantifies how concentrated the attention is within the table (lower entropy indicates a more focused attention distribu- tion on specific cells); and (3) attention contribu- tion captures the actual influence of each segment on the final output via L2 norm of value-weighted representations. Formal definitions and formulas of these metrics are provided in Appendix B. For all metrics, we first average across all generated tokens of each sample, and then average across all 2,000 test samples. The attention dynamics of Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct is shown in Figure 1. Results of more models are shown in Appendix F. 4.1 How Does the Model Attend to the Table? Takeaway 1 Although the overall attention ratio to table content is modest compared to other seg- ments, LLMs can precisely focus on query- relevant cells in the middle layers, suggest- ing that accurate localization matters more than total attention volume. Allocated attention ratio across input segments. As shown in Figure 1 (lower left), the attention allocated to <table content> is highest in the ear- liest layers, then it drops before rising again to a secondary peak in the lower-middle layers (layer 10-15) and finally decreases to a stable level in the top layers (layer 20-30). Notably, the overall changes of table-oriented attention across different layers are modest. We observe a similar trend of attention allocated to <user question>, with both reaching their peak in the middle layers, which may suggest that the model is integrating question information to interpret table structures and iden- tify critical table content. Except for middle layers, the <system prompt> consistently attracts rela- tively higher attention scores across all layers than user question and table content, which could be at- tributed to the attention sink phenomenon observed in prior work (Xiao et al., 2023; Gu et al., 2024; Kang et al., 2025a). The step-wise segment atten- tion ratio in specific layers throughout the token- by-token generation process of one sample is also visualized in the upper left sub-panels in Figure 1, which provides evidence of observed patterns. The influence of model scales.Comparing mod- els of different scales within the same family (de- 3 Q: which nation won the most gold medals? M: Meta-LLaMA-3.1-8B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L10 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L14 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L14 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L25 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L25 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L31 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L31 Table Cells Attn Distri 051015202530 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 051015202530 Layer Index 4.4 4.6 4.8 5.0 Entropy Table Attn Entropy 051015202530 Layer Index 0.0 0.2 0.4 0.6 0.8 L2 Norm Attn Contribution Q: which nation won the most gold medals? M: Qwen2.5-7B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L10 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L15 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L22 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L27 Table Cells Attn Distri 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0510152025 Layer Index 4.0 4.5 5.0 Entropy Table Attn Entropy 0510152025 Layer Index 0 2 4 6 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 1: The attention dynamics of Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct on tabular tasks. Each input consists of three segments: system prompt, table content, and user question. Upper: A case study at selected layers—left sub-panels show step-wise attention ratio trend throughout generation steps; right sub-panels show cell-level attention heatmaps. Lower: Aggregated results—Layer-wise Segment Attn Ratio shows the proportion of attention allocated to each segment per layer; Table Attn Entropy measures the degree of focus towards table cells, with lower entropy indicating a more concentrated attention distribution on specific cells); Attn Contribution measures the influence of each segment on the final output. Notably, the entropy minimum at Layer 10 and Layer 14 for Llama3.1-8B-Instruct align with the concentrated attention on the answer cell “China” in the upper heatmap. tailed results in Appendix F), we find that overall at- tention distribution trends remain consistent. How- ever, two notable differences emerge: (1) the sec- ondary peak of table attention shifts toward deeper layers as model size increases, and (2) the layer range with low table attention entropy becomes broader in larger models, suggesting they maintain focused attention across more layers. The influence of model architectures.Compar- ing Llama3.1-8B and Qwen2.5-7B (Figure 1), we observe that Qwen allocates higher attention to table content in early layers. MoE models ex- hibit a more gradual decline in table attention en- tropy, with a distinctive three-phase pattern: atten- tion first covers partial cells, then spreads across the entire table, and finally converges to answer- 4 relevant cells (detailed in §6). For tabular LLMs like TableGPT2 (Su et al., 2024), table-specific fine- tuning leads to higher table attention and weaker attention sink effects, though overall patterns re- main similar to their base models. Attention distribution within the table. While the overall attention allocated to table is limited, does the model at least focus on the query-related table cells? To answer this, we measure the table attention entropy over table tokens at each layer. As shown in Figure 1 (lower middle), the table attention entropy follows a U-shaped pattern: it decreases from early layers, and reaches a mini- mum in the middle layers, and finally increases again. The attention heatmaps in Figure 1 (upper right sub-panels) also illustrate this pattern. In early layers (e.g., Layer 0), attention is spread broadly across the entire table, suggesting that the model scans all content. In middle layers (e.g., Layer 14 for LLaMA-3.1-8B-Instruct), attention becomes concentrated on specific cells—for the question “which nation won the most gold medals?”, the model focuses on the “China” cell and the “Gold” column header, which directly correspond to the answer. In later layers, attention spreads slightly again as the model prepares to generate the answer. This pattern is consistent across different models and benchmarks. We provide more case studies and model comparisons in the Appendix F. These findings suggest that LLMs do not require heavy attention to all table content. Instead, they selectively focus on query-relevant cells, much like humans skimming a table for key information. This implies that low-attention table regions could be pruned to develop efficient tabular LLMs with less performance loss. 4.2 How Much Does the Table Content Contribute to the Final Output? Takeaway 2 Despite receiving modest attention, table content makes substantial contributions to the model’s output, with its influence in- creasing in later layers and ultimately driv- ing answer generation. Attention weights show where the model "looks", but not how much the attended content actually influences the output. As noted by prior work (Kobayashi et al., 2020; Gu et al., 2024), the Llama-3.1-8B (Instruct) Qwen2.5-7B (Instruct) TAMA (Llama-based) Table-R1-7B (Qwen-based) Table-R1-8B (Llama-based) 0 10 20 30 40 50 60 70 Accuracy Drop ( ) 50 54 59 64 65 46 44 56 52 60 19 52 25 60 19 7 52 6 61 4 Llama-family: Deep masking has minimal impact Qwen-family: uniformly severe across all ranges Shallow (1L/4)S-M (L/4L/2)M-D (L/23L/4)Deep (3L/4L) Figure 2: Performance drop when masking table atten- tion in different layer ranges across 5 models (averaged over 3 benchmarks×2 formats). Llama-family models show minimal impact from Deep masking, while Qwen- family models degrade uniformly across all ranges. true impact depends on both the attention weight and the transformed value vector. Therefore, we measure the value-weighted contribution—the L2 norm of each segment’s representation injected into the transformer residual stream (see Appendix B for formal definition). As shown in Figure 1 (lower right), despite the modest attention ratio observed in §4.1, the table segment makes substantial contributions that in- crease in later layers, peaking near the final lay- ers. In contrast, the system prompt’s contribution remains low across all layers despite its high atten- tion ratio, confirming its role as an attention sink rather than an information source. This finding connects previous observations into a coherent narrative of the model’s internal work- flow. While early layers broadly encode the table (high ratio, high entropy), middle layers act as a critical reasoning phase characterized by "concen- trated influence": the model narrows focus to rele- vant cells (low entropy) and amplifies their impact on the residual stream (rising norm). Finally, in late layers, the question and focused table content jointly drive answer generation. 4.3 Does the Attention Pattern Causally Affect Output? Takeaway 3 The three-phase attention pattern is func- tionally essential rather than a mere pro- cessing correlate: masking table attention in early-to-middle layers causes severe per- formance degradation, and middle-layer at- tention focus on answer cells is predictive of output correctness. 5 Causal validation via attention masking. We zero out attention weights to all table tokens within consecutive quarter-layer blocks—Shallow (1–L/4), S-M (L/4–L/2), M-D (L/2–3L/4), and Deep (3L/4–L)—across 5 models, 3 benchmarks, and 2 formats. As shown in Figure 2, masking early layers (Shallow, S-M) universally causes the largest performance drops (44–65 points), directly demon- strating that the broad scanning phase is function- ally essential to downstream prediction. Llama- family models (Llama-3.1-8B, TAMA, Table-R1- 8B) show the most severe degradation from Shal- low and S-M masking with minimal late-layer im- pact (TAMA drops by 58.8 and 56.3 points for Shallow and S-M, respectively, but only 6.5 points for Deep), empirically validating our interpretation that late layers serve an amplification rather than retrieval role. In contrast, Qwen-family models (Qwen2.5-7B, Table-R1-7B) show uniformly se- vere degradation across all masking ranges, reflect- ing their more distributed attention profiles (§4.1). Attention-answer correspondence.To comple- ment the causal analysis, we select 200 lookup- type examples and identify the Top-20 attention- weighted cells in low-entropy layers (layers 10–15) for Llama-3.1-8B and Qwen2.5-7B. Cells focused on by low-entropy layers directly correspond to the model’s final response in 53.0% (Llama) and 43.9% (Qwen) of cases. When low-entropy layers successfully attend to the correct answer cell, final prediction accuracy reaches 65.6% (Llama) and 71.2% (Qwen). Conversely, when models answer incorrectly, 52.8% (Llama) and 74.4% (Qwen) of errors are associated with low-entropy layers fail- ing to attend to the correct cell. This suggests that misfocused middle-layer attention is a frequent pre- dictor of incorrect predictions, pointing to potential utility for attention-based error detection. 5 Effective Depth for Tabular Tasks The previous analysis revealed that middle lay- ers concentrate attention on query-relevant cells while later layers amplify their contributions. A natural follow-up question is: how many layers do LLMs actually need to complete table under- standing tasks? To investigate this, we apply Log- itLens (Nostalgebraist., 2020), which decodes hid- den representations at each layer into vocabulary distributions. We measure the KL divergence be- tween each layer’s prediction and the final output, along with the top-5 token overlap. 048121620242832 Layer Index 0 2 4 6 8 10 12 14 KL Divergence Meta-Llama-3.1-8B-Instruct 0481216202428 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen2.5-7B-Instruct 0246810121416 Layer Index 0 2 4 6 8 10 12 14 KL Divergence OLMoE-1B-7B-Instruct 0612182430364248 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen3-30B-A3B-Instruct 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap Figure 3: Prediction stability analysis using LogitLens. Bars show KL divergence between each layer’s decoded distribution and the final output (lower = closer to final prediction). Lines show top-5 token overlap with the final output. Vertical dashed lines mark where predic- tions stabilize. Takeaway 3 Table tasks engage more layers for output distribution refinement than math reasoning, though the depth at which answer content crystallizes is comparable across tasks. This pattern is consistent across model scales and fine-tuning strategies, suggesting that layer- wise functionality is largely determined dur- ing pre-training. As shown in Figure 3, KL divergence remains relatively high throughout the middle layers and only drops sharply in final layers. To directly vali- date whether this pattern distinguishes table tasks from other reasoning tasks, we run the same Log- itLens pipeline on GSM8K across 7 representative models spanning general instruction-tuned, table- specific fine-tuned, and MoE categories. We de- compose stabilization depth into two complemen- tary metrics: KL-based depth (where output distri- bution converges) and Top-koverlap depth (where the final answer content crystallizes). As shown in Table 2, KL-based stabilization depth is consistently higher for table tasks than math across the majority of models (5 of 7, marked with†), indicating that LLMs continue refining their output distributions across more layers when processing tabular data. Top-koverlap depth, by contrast, is broadly comparable across both task types, suggesting that while the distribution refine- 6 Table 2: Stabilization depth comparison between table and math (GSM8K) tasks. Values are reported as effec- tive layer / total layers (ratio). KL-based depth is higher for table tasks in 5 of 7 models (†), while Top-koverlap depth is broadly comparable. KL DepthOverlap Depth ModelLayersTableMathTableMath Instruct LLMs Qwen2.5-3B3632 (0.89) † 27 (0.75)33 (0.92)31 (0.86) Qwen2.5-7B2825 (0.89) † 23 (0.82)25 (0.89)25 (0.89) Qwen2.5-14B4844 (0.92) † 43 (0.90)44 (0.92)43 (0.90) Llama-3.1-8B3226 (0.81)29 (0.91)25 (0.78)29 (0.91) Table-Specific Fine-tuned LLMs TAMA3226 (0.81) † 23 (0.72)25 (0.78)26 (0.81) Table-R1-7B2825 (0.89) † 24 (0.86)25 (0.89)25 (0.89) MoE LLMs Qwen3-30B-A3B4843 (0.90)43 (0.90)45 (0.94)43 (0.90) ment process engages more layers for tables, the point at which final answer content crystallizes is similar. This decomposition provides a more pre- cise characterization: table understanding does not simply “take longer” uniformly, but specifically re- quires more layers for the fine-grained distribution adjustments that follow answer identification. Notably, this pattern holds consistently across different model scales (7B to 32B) and training strategies (SFT and RL), suggesting that the layer- wise functionality for table understanding is largely established during pre-training (detailed results in Appendix D.2). This finding has practical impli- cations: unlike math reasoning where early-exit strategies can reduce computation (Csordás et al., 2025; Hu et al., 2025), table tasks may benefit less from such optimizations. Conversely, it suggests that targeted interventions in the later layers (e.g., inference-time steering) could be particularly effec- tive for improving table understanding capabilities. 6 Seeking Table Experts of MoE Models Takeaway 4 MoE models activate a distinct set of table- specific experts concentrated in the middle layers, mirroring the attention patterns ob- served in RQ1. These experts show minimal overlap with math-related experts. Mixture-of-Experts (MoE) models route each to- ken to a subset of experts based on learned gating functions. A natural question arises: do MoE mod- els develop specialized experts for tabular tasks? To identify table-specific experts, we record the activation frequency of each expert across all ta- ble tokens and layers, following prior work (Wang et al., 2024a; Bai et al., 2025). As shown in Fig- ure 4 (left), table-relevant experts are distributed across all layers, but their activation patterns vary by depth. The middle column of Figure 4 reveals a striking pattern: the entropy of expert activation dis- tribution reaches its minimum in the middle layers across all three MoE architectures. Lower entropy indicates that the model concentrates its routing on fewer, more specialized experts. This pattern mir- rors the attention entropy findings in RQ1, where attention within the table also becomes most con- centrated in the middle layers. Furthermore, these table-specific experts in the middle layers show minimal overlap with experts activated for math reasoning tasks (Figure 4, right), confirming their domain-specific nature. Together, these findings suggest that MoE mod- els achieve efficient multi-task processing by shar- ing general-purpose experts in early and late layers while routing to specialized experts in the middle layers for domain-specific reasoning. 7 Impact of Input Formats and Reasoning Strategies The previous sections analyzed LLMs’ internal mechanisms under a fixed setting (Markdown for- mat, direct answering). In practice, however, ta- bles can be represented in various formats (e.g., Markdown, HTML), and models can be prompted with different reasoning strategies (e.g., direct an- swering, Chain-of-Thought). Here we examine how these choices influence internal processing and model behavior. 7.1How Does Table Format Affect Attention? Takeaway 5 HTML tables lead to higher attention en- tropy than Markdown tables. However, in- ternal representations of different formats converge in deeper layers. To investigate how table format affects inter- nal processing, we convert our analysis data from Markdown to HTML format and compare the at- tention patterns. As shown in Figure 5, LLMs as- sign slightly higher attention ratio to HTML tables, but the attention entropy is significantly higher. This may be because the verbose structural syn- tax in HTML (e.g.,<tr>,<td>tags) disperses the 7 261014182226 0 8 16 24 32 40 48 56 Expert ID Tabular Expert Activation 369121518212427 3.95 4.00 4.05 4.10 Entropy DeepSeek-V2 Expert Distribution Entropy 369121518212427 0 1 2 3 4 Count Overlap w/ Math Experts 1917253341 0 16 32 48 64 80 96 112 Expert ID Tabular Expert Activation 0612182430364248 4.0 4.2 4.4 Entropy Qwen3-30B-A3B Expert Distribution Entropy 0612182430364248 0 1 2 3 4 5 6 Count Overlap w/ Math Experts 13579111315 Layer Index 0 8 16 24 32 40 48 56 Expert ID Tabular Expert Activation 246810121416 Layer Index 3.6 3.8 4.0 Entropy OLMoE Expert Distribution Entropy 0246810121416 Layer Index 0 1 2 3 Count Overlap w/ Math Experts Tabular DomainOverlap with Math Domain Figure 4: Expert activation analysis in MoE models across three architectures (DeepSeek-V2, Qwen3-30B-A3B, OLMoE). Left: Activation heatmaps showing table-specific experts per layer (darker = higher activation probability). Middle: Entropy of expert activation distribution for table tasks (lower entropy indicates concentrated activation on fewer experts). Right: Number of overlapping experts between table and math (GSM8K) tasks per layer. 051015202530 Layer Index 4.50 4.75 5.00 5.25 5.50 5.75 Entropy  LLaMA-3.1-8B-Ins Table Attention Entropy Markdown HTML 051015202530 Layer Index 5 10 15 20 25 30 35 Ratio (%) LLaMA-3.1-8B-Ins Table Attn Ratio Markdown HTML t-SNE Visualization LLaMA-3.1-8B-Ins Layer 1 Markdown HTML LLaMA-3.1-8B-Ins Layer 15LLaMA-3.1-8B-Ins Layer 31 0510152025 Layer Index 3.75 4.00 4.25 4.50 4.75 5.00 5.25 5.50 Entropy  Qwen-2.5-7B-Ins Table Attention Entropy 0510152025 Layer Index 10 20 30 40 Ratio (%) Qwen-2.5-7B-Ins Table Attn Ratio t-SNE Visualization Qwen-2.5-7B-Ins Layer 1 Markdown HTML Qwen-2.5-7B-Ins Layer 15Qwen-2.5-7B-Ins Layer 27 Figure 5: Comparison between Markdown and HTML table formats. model’s attention across more tokens, making it harder to concentrate on specific cells. Despite these differences in early layers,t- SNE visualizations reveal that the internal rep- resentations of different formats gradually con- verge as layer depth increases. This suggests that while early layers process format-specific features, deeper layers extract format-agnostic semantic in- formation. This convergence may explain why LLMs can achieve reasonable performance across different table formats. As shown in Table 3, the performance gap between Markdown and HTML formats is minimal across all models (typically within 1-2%), empirically confirming that LLMs can effectively handle both formats despite their different syntactic structures. 7.2 How Does Reasoning Strategy Affect Attention? Takeaway 6 Chain-of-Thought (CoT) reasoning leads to higher attention to table content in middle and late layers compared to direct answer- ing, an effect that persists even when analyz- ing only final answer tokens. Table-specific fine-tuning further amplifies this effect. We compare the attention patterns under direct answering and Chain-of-Thought (CoT) prompt- ing for both vanilla LLMs and their table-tuned variants. As shown in Figure 6 (bottom), CoT rea- soning leads to a higher proportion of attention 8 051015202530 Layer Index 5 10 15 20 25 30 Table Attn Ratio (%) LLaMA-based Models (8B) | Direct LLaMA-3.1-8B-Ins Tama-8B Table-R1-Zero-8B 051015202530 Layer Index 5 10 15 20 25 30 LLaMA-based Models (8B) | CoT 051015202530 Layer Index -1 0 1 2 3 4 5 Table Attn Ratio Gain (%) Δ Gain: LLaMA-3.1-8B-Ins (CoT - Direct) 0510152025 Layer Index 10 20 30 40 50 Table Attn Ratio (%) Qwen-based Models (7B) | Direct Qwen-2.5-7B-Ins Tama-Qwen-7B Table-R1-Zero-7B 0510152025 Layer Index 10 20 30 40 50 60 Qwen-based Models (7B) | CoT 0510152025 Layer Index -2 0 2 4 6 8 Table Attn Ratio Gain (%) Δ Gain: Qwen-2.5-7B-Ins (CoT - Direct) Figure 6: Comparison between direct answering and Chain-of-Thought (CoT) reasoning. Table 3: Average performance over 2,000 test samples under different input formats and inference strategies. MarkdownHTML ModelVanillaCoTVanillaCoT Instruct LLMs Qwen2.5-7B62.1567.2061.3567.80 Llama3.1-8B56.9562.3557.9558.35 Table-Specific Fine-tuned LLMs Table-R1-7B69.7582.3070.3081.60 Table-R1-8B70.8582.2569.9080.60 MoE LLMs Qwen3-30B-A3B66.3070.1568.3572.10 allocated to table content, especially in the middle and late layers. Comparing vanilla and table-tuned models (Figure 6, top), table-specific fine-tuning generally increases attention to tables across all lay- ers, with RL-tuned models showing the largest in- crease when combined with CoT. The performance results in Table 3 corroborate this: CoT consistently improves accuracy across all models, with partic- ularly striking gains for table-tuned models (e.g., Table-R1-7B improves from 69.75% to 82.30%). 1 To further isolate CoT’s effect on table ground- ing from the natural table references within rea- soning chains, we recompute the Table Attn Ratio using only the attention weights at final answer 1 All CoT vs. Direct differences are statistically signifi- cant (non-overlapping 95% bootstrap CIs, 1,000 iterations, N=2,000). Markdown vs. HTML differences fall within the margin of uncertainty for most models. Complete results with confidence intervals are provided in Appendix D.1. tokens, excluding all reasoning chain tokens. In early layers—layer 1 for LLaMA-3.1-8B (+1.85%) and layers 1–4 for Qwen2.5-7B (+13.56%, +9.12%, +1.52%, +7.45%)—CoT maintains a substantially higher Table Attn Ratio than direct answering even under this restricted setting, confirming that CoT genuinely induces stronger table grounding at the moment of answer generation. In middle-to-late layers, the final-answer-only ratio under CoT falls slightly below that of direct answering, suggesting a division of labor: the reasoning chain has already distilled relevant table information into the residual stream, allowing later layers to attend to the nearby reasoning trajectory rather than re-attending to dis- tant table tokens. 8 Conclusion We present the first systematic empirical study on the internal mechanisms of LLM-based table under- standing across 16 LLMs. Our analysis reveals a co- herent three-phase workflow: early layers broadly encode the table, middle layers precisely localize query-relevant cells, and late layers amplify the fo- cused content for answer generation. This pattern manifests consistently across architectures—MoE models mirror this through table-specific expert activation in middle layers, while CoT reasoning and table-specific fine-tuning further enhance table engagement. These interconnected findings offer actionable guidance: from optimal input config- urations, to inference-time interventions in later layers, to targeted optimization of table-specialized experts. 9 9 Limitations Though this paper presents a systematic empirical study of the internal mechanisms of LLM-based ta- ble understanding, there are certain limitations and promising directions that deserve future research. (1) Analyzing LLMs of larger scales and differ- ent series. Due to resource limitations, we mainly analyze advanced open-source models with param- eters below 32B. Models of large scales such as DeepSeek-V3-671B and Qwen3-235B-A22B also deserves further investigation to analyze the scal- ing effect on table understanding. (2) Extending to MLLMs. Previous work and recent proprietary models have demonstrated that multimodal LLMs (MLLMs) also possess strong table understanding ability based on table images (Zheng et al., 2024; Deng et al., 2024; Zhou et al., 2025b). As a result, it is worthwhile to extend our analysis to MLLM scope to discover unique characteristics of multi- modal table understanding. (3) Analyzing more tabular tasks. This study primarily focuses on table question answering and table fact verification, two most typical tasks of existing studies. However, it is pertinent to investigate whether our findings can generalize to more complex tabular tasks, such as data analysis and table summarization. (4) En- riching table formats and reasoning strategies. We do not intend to exhaust every possible table for- mats such as CSV and JSON formats and every reasoning strategies such as program-of-thoughts with in-context examples, which could be further explored by future follow-ups. 10 Ethical Considerations Our empirical study and all experiments are con- ducted based 4 public academic benchmarks, which are free and open-source data for research use. The analyzed LLMs are also open-source mod- els that we downloaded from their official websites. As a result, the authors foresee no ethical concerns. References Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, and Zi- long Zheng. 2025. Understanding and leveraging the expert specialization of context faithfulness in mixture-of-experts LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21938–21953, Suzhou, China. Association for Computational Linguistics. Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, and Nanyun Peng. 2025.Multilin- gual routing in mixture-of-experts. arXiv preprint arXiv:2510.04694. Anna Bavaresco, Raffaella Bernardi, Leonardo Berto- lazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Bar- bara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 2025. LLMs instead of human judges? a large scale empirical study across 20 NLP evalua- tion tasks. In Proceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 2: Short Papers), pages 238–255, Vienna, Austria. Association for Computational Linguistics. Kushal Raj Bhandari, Sixue Xing, Soham Dan, and Jianxi Gao. 2025. Exploring the robustness of lan- guage models for tabular question answering via at- tention analysis. Preprint, arXiv:2406.12719. Wenhu Chen. 2023. Large language models are few(1)- shot table reasoners. In Findings of the Associa- tion for Computational Linguistics: EACL 2023, pages 1120–1130, Dubrovnik, Croatia. Association for Computational Linguistics. Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020. Tabfact: A large-scale dataset for table-based fact verification. Preprint, arXiv:1909.02164. Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. 2021. Hitab: A hierarchical table dataset for question answering and natural language generation. arXiv preprint arXiv:2108.06712. Róbert Csordás, Christopher D Manning, and Christo- pher Potts. 2025. Do language models use their depth efficiently? arXiv preprint arXiv:2505.13898. DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingx- uan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, and 81 others. 2024. Deepseek-v2: A strong, economical, and ef- ficient mixture-of-experts language model. CoRR, abs/2405.04434. Naihao Deng and Rada Mihalcea. 2025. Rethinking table instruction tuning. In Findings of the Asso- ciation for Computational Linguistics: ACL 2025, pages 21757–21780, Vienna, Austria. Association for Computational Linguistics. Naihao Deng, Zhenjie Sun, Ruiqi He, Aman Sikka, Yu- long Chen, Lin Ma, Yue Zhang, and Rada Mihalcea. 2024. Tables as texts or images: Evaluating the table reasoning ability of LLMs and MLLMs. In Find- ings of the Association for Computational Linguis- tics: ACL 2024, pages 407–426, Bangkok, Thailand. Association for Computational Linguistics. 10 Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5484–5495. Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. 2024. When attention sink emerges in lan- guage models: An empirical view. arXiv preprint arXiv:2410.10781. Yi Hu, Cai Zhou, and Muhan Zhang. 2025. What affects the effective depth of large language models? arXiv preprint arXiv:2512.14064. Deyi Ji, Lanyun Zhu, Siqi Gao, Peng Xu, Hongtao Lu, Jieping Ye, and Feng Zhao. 2024. Tree-of-table: Un- leashing the power of llms for enhanced large-scale table understanding. Preprint, arXiv:2411.08516. Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Xin Zhao, and Ji-Rong Wen. 2023. StructGPT: A general framework for large language model to reason over structured data. In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 9237–9251, Singapore. Associa- tion for Computational Linguistics. Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. 2025a. See what you are told: Visual atten- tion sink in large multimodal models. arXiv preprint arXiv:2503.03321. Xiaoqiang Kang, Shengen Wu, Zimu Wang, Yilin Liu, Xiaobo Jin, Kaizhu Huang, Wei Wang, Yutao Yue, Xiaowei Huang, and Qiufeng Wang. 2025b. Can GRPO boost complex multimodal table understand- ing?In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12631–12644, Suzhou, China. Association for Computational Linguistics. Yannis Katsis, Saneem Chemmengath, Vishwajeet Ku- mar, Samarth Bharadwaj, Mustafa Canim, Michael Glass, Alfio Gliozzo, Feifei Pan, Jaydeep Sen, Karthik Sankaranarayanan, and Soumen Chakrabarti. 2021.Ait-qa: Question answering dataset over complex tables in the airline industry. Preprint, arXiv:2106.12944. Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is not only a weight: Analyzing transformers with vector norms.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7057–7075, Online. Association for Computa- tional Linguistics. Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2023. Analyzing feed-forward blocks in transformers through the lens of attention maps. arXiv preprint arXiv:2302.00456. Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. 2023. Sheetcopilot: Bringing software productivity to the next level through large language models. In Thirty-seventh Conference on Neural Information Processing Systems. Jinyang Li, Nan Huo, Yan Gao, Jiayi Shi, Yingxiu Zhao, Ge Qu, Bowen Qin, Yurong Wu, Xiaodong Li, Chen- hao Ma, Jian-Guang Lou, and Reynold Cheng. 2025. Are large language models ready for multi-turn tabu- lar data analysis? In Proceedings of the 42nd Inter- national Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 34795–34835. PMLR. Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. 2025. A closer look into mixture-of-experts in large language models. In Findings of the Associ- ation for Computational Linguistics: NAACL 2025, pages 4427–4447. Weizheng Lu, Jing Zhang, Ju Fan, Zihao Fu, Yueguo Chen, and Xiaoyong Du. 2025. Large language model for table processing: a survey. Frontiers of Computer Science, 19(2). Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associa- tions in gpt. Advances in neural information process- ing systems, 35:17359–17372. Microsoft. 2025. Copilot in excel. Accessed: 2025-12- 29. Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, and 5 others. 2024. Olmoe: Open mixture-of-experts language models. CoRR, abs/2409.02060. Nostalgebraist.2020.Interpreting gpt:thelogitlens.https://w. lesswrong.com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens. OpenAI. 2024. Improvements to data analysis in chat- gpt. Accessed: 2025-12-29. Panupong Pasupat and Percy Liang. 2015. Composi- tional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the As- sociation for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470– 1480, Beijing, China. Association for Computational Linguistics. 11 Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Gated attention for large lan- guage models: Non-linearity, sparsity, and attention- sink-free. Preprint, arXiv:2505.06708. Enrique Queipo-de Llano, Álvaro Arroyo, Federico Bar- bero, Xiaowen Dong, Michael Bronstein, Yann Le- Cun, and Ravid Shwartz-Ziv. 2025. Attention sinks and compression valleys in llms are two sides of the same coin. arXiv preprint arXiv:2510.06477. Alexey Shigarov. 2023. Table understanding: Prob- lem overview. WIREs Data Mining and Knowledge Discovery, 13(1):e1482. Qingyi Si, Tong Wang, Zheng Lin, Xu Zhang, Yanan Cao, and Weiping Wang. 2023. An empirical study of instruction-tuning large language models in Chinese. In Findings of the Association for Computational Lin- guistics: EMNLP 2023, pages 4086–4107, Singapore. Association for Computational Linguistics. Ananya Singha, José Cambronero, Sumit Gulwani, Vu Le, and Chris Parnin. 2023.Tabular repre- sentation, noisy operators, and impacts on table structure understanding tasks in llms.Preprint, arXiv:2310.10358. Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz- Ziv. 2025. Layer by layer: Uncovering hidden rep- resentations in language models. arXiv preprint arXiv:2502.02013. Aofeng Su, Aowen Wang, Chao Ye, Chen Zhou, Ga Zhang, Gang Chen, Guangcheng Zhu, Haobo Wang, Haokai Xu, Hao Chen, Haoze Li, Haoxuan Lan, Jiaming Tian, Jing Yuan, Junbo Zhao, Jun- lin Zhou, Kaizhe Shou, Liangyu Zha, Lin Long, and 14 others. 2024. Tablegpt2: A large multi- modal model with tabular data integration. Preprint, arXiv:2411.02059. Zunhai Su, Qingyuan Li, Hao Zhang, Weihao Ye, Qibo Xue, YuLei Qian, Yuchen Xie, Ngai Wong, and Ke- hong Yuan. 2025. Unveiling super experts in mixture- of-experts large language models. arXiv preprint arXiv:2507.23279. Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. 2024. Table meets llm: Can large language models understand structured table data? a benchmark and empirical study. Preprint, arXiv:2305.13062. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30. Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhu- oshu Li, and Yu Wu. 2024a. Let the expert stick to his last: Expert-specialized fine-tuning for sparse archi- tectural large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 784–801. Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Mi- culicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister. 2024b. Chain-of-table: Evolving tables in the reasoning chain for table understanding. In ICLR. Han Weng, Zhou Liu, Yuanfeng Song, Xiaoming Yin, Xing Chen, and Wentao Zhang. 2025. Unidatabench: Evaluating data analytics agents across structured and unstructured data. Preprint, arXiv:2511.01625. Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2024. Retrieval head mechanisti- cally explains long-context factuality. arXiv preprint arXiv:2404.15574. Zhenhe Wu, Jian Yang, Jiaheng Liu, Xianjie Wu, Changzai Pan, Jie Zhang, Yu Zhao, Shuangyong Song, Yongxiang Li, and Zhoujun Li. 2025. Table- r1: Region-based reinforcement learning for table understanding. Preprint, arXiv:2505.12415. Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Shang Yang, Haotian Tang, Yao Fu, Song Han, and 1 others. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. In The Thirteenth International Conference on Learning Representa- tions. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025a. Qwen3 technical report. arXiv preprint arXiv:2505.09388. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian- hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 22 oth- ers. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. Zheyuan Yang, Lyuhao Chen, Arman Cohan, and Yilun Zhao. 2025b. Table-r1: Inference-time scaling for ta- ble reasoning tasks. In Proceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 20616–20635, Suzhou, China. As- sociation for Computational Linguistics. Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. 2023. Large language mod- els are versatile decomposers: Decomposing evi- dence and questions for table-based reasoning. In Proceedings of the 46th International ACM SIGIR 12 Conference on Research and Development in Infor- mation Retrieval, SIGIR ’23, page 174–184, New York, NY, USA. Association for Computing Machin- ery. Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun. 2024. TableLlama: Towards open large general- ist models for tables. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6024–6044, Mexico City, Mexico. Association for Computational Linguistics. Xiaokang Zhang, Sijia Luo, Bohan Zhang, Zeyao Ma, Jing Zhang, Yang Li, Guanlin Li, Zijun Yao, Kangli Xu, Jinchang Zhou, Daniel Zhang-Li, Jifan Yu, Shu Zhao, Juanzi Li, and Jie Tang. 2025. TableLLM: Enabling tabular data manipulation by LLMs in real office usage scenarios. In Findings of the Associa- tion for Computational Linguistics: ACL 2025, pages 10315–10344, Vienna, Austria. Association for Com- putational Linguistics. Mingyu Zheng, Xinwei Feng, Qingyi Si, Qiaoqiao She, Zheng Lin, Wenbin Jiang, and Weiping Wang. 2024. Multimodal table understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9102–9124, Bangkok, Thailand. Association for Computational Linguistics. Wei Zhou, Mohsen Mesgar, Heike Adel, and Annemarie Friedrich. 2025a. p2-tqa: A process-based preference learning framework for self-improving table question answering models. Preprint, arXiv:2505.17565. Wei Zhou, Mohsen Mesgar, Heike Adel, and Annemarie Friedrich. 2025b. Texts or images? a fine-grained analysis on the effectiveness of input representations and models for table question answering. In Find- ings of the Association for Computational Linguis- tics: ACL 2025, pages 2307–2318, Vienna, Austria. Association for Computational Linguistics. A Background A.1 Task Formalization Given a tableTand a natural language questionQ, Table Question Answering (TableQA) requires the modelG(·)to generate a final answerAbased on the above information. Within the context of Large Language Models (LLMs), the input sequencex is composed of three segments serving distinct se- mantic roles: •System Prompt Segment (S sys ):Com- prises the system instructions and model- specificconversationtemplates(e.g., <|begin_of_text|>forLlama-3.1-8B- Instruct (Dubey et al., 2024)). •Table Content Segment (S tab ): Contains the table data serialized in a specific format (e.g., Markdown or HTML). •Question Description Segment (S qst ): Con- tains the specific user queryQ. LetI sys ,I tab , andI q denote the sets of in- dices corresponding to these segments in the in- put sequence, the total input index set isI in = I sys ∪ I tab ∪ I q . The set of indices for tokens generated by the model is denoted asI gen . A.2 Architecture of LLMs The input sequence is processed by the LLM throughLtransformer blocks (Vaswani et al., 2017). The update process of residual stream at each layer l can be unified as follows: ˆx l = MHA l (LN(x l−1 )) + x l−1 (1) x l = Trans l (LN(ˆx l )) + ˆx l (2) wherex l ∈R n×d represents the hidden state at layerl,dis the embedding dimension, andn is the total sequence length.MHAdenotes the Multi-Head Attention module, andTransrepre- sents the transition layer, which is a Feed-Forward Network (FFN) in dense LLMs or a Mixture-of- Experts (MoE) layer in MoE LLMs. In thel-th layer, the MHA mechanism allocates attention weights by computing the similarity be- tween Queries (Q) and Keys (K): α l,h i,j = softmax (x l−1 i W l,h Q )(x l−1 j W l,h K ) T √ d k ! (3) whereα l,h i,j denotes the attention weight from token ito tokenjin theh-th attention head of thel-th layer. The final output of a single attention head is the aggregation of contributions from all preceding tokens: MHA l,h (x i ) = X j≤i α l,h i,j (x l−1 j W l,h OV )(4) Here,W l,h OV = W l,h V W l,h O represents the joint value- output projection matrix, which defines the specific path through which information is mapped from input tokens into the residual stream. For MoE LLMs, the transition layer introduces a sparse activation mechanism where a routing func- tionGselects the Top-kexperts for each token. 13 The output is scaled by the routing weights and aggregated: MoE(x i ) = X e∈Top-k(G i ) G i,e · E e (LN(x i ))(5) whereG i = softmax(x i W G )andE e denotes the e-th expert. Through this hierarchical process, the LLM transforms internal representations into the final response. B Interpretability Analysis Methods This section introduces our core methodology for revealing the internal mechanisms of LLMs during tabular reasoning, categorized into three dimen- sions: Inner Prompt, Inner Table, and Contribu- tion. B.1 Inner Prompt: Segment Attention Ratio To quantify the allocation of attention resources across different input segments during inference, we define the segment attention ratioD, which measures the degree of attention weight assigned to a specific segmentSat layerl, averaged across all attention heads and the entire generation sequence I gen : D l (S) = 1 |I gen |· H X i∈I gen H X h=1 X j∈S ˆα l,h i,j (6) whereˆα l,h i,j = α l,h i,j P k∈I in α l,h i,k represents the rescaled weights for the input segmentI in at generation stepi ∈ I gen . By observing the trend ofD l (S) across layers, we can identify the dynamic transi- tion of attention from the structured background (S tab ) to the task objective (S q ). B.2 Inner Table: Table Attention Entropy To qualitatively describe the degree of focus on rel- evant cells within the table, we define the average table attention entropy at layer l as: H l table = 1 |I gen | X i∈I gen   − X j∈I tab ̄ P l i (j) log ̄ P l i (j)   (7) where the inner-table attention distribution ̄ P l i (j) is defined as: ̄ P l i (j) = P H h=1 α l,h i,j P k∈I tbl P H h=1 α l,h i,k , ∀j ∈I tab (8) A higher entropy indicates a uniform distribu- tion of attention across the table, whereas a lower entropy signifies that attention is concentrated in a few specific cells. This implies that the layer may identify the task-relevant information. B.3 Value-Weighted Contribution Following the perspective of previous re- searches (Kobayashi et al., 2020; Gu et al., 2024; Kang et al., 2025a) that attention weights alone do not quantify the actual magnitude of change in the residual stream, we introduce the Value-Weighted Contribution analysis. To capture the model’s behavior throughout the generation process, we compute the averaged contribution overI gen : C l (S) = 1 |I gen |· H X i∈I gen X j∈S H X h=1 α l,h i,j x l−1 j W l,h OV 2 (9) whereS ∈ I sys ,I tab ,I q . By integrating the hidden statesx j with the projection matrixW OV , this formula faithfully reconstructs the intensity of the information volume injected by a specific seg- mentS at layer l to drive the following reasoning. C Experimental Details C.1 Prompt Templates We provide the complete prompt templates used in our experiments. The input structure follows the standard practice of prompting LLMs for table tasks, consisting of three segments: (1) a prompt template, (2) a table content segment, and (3) a user question segment. Overall Prompt Template System Prompt: You are a helpful assistant. User Prompt: TABLE: table_content INPUT: question_content 14 Prompt Template for Table Question Answering (WTQ, HiTab, AITQA) TABLE: table_content (markdown / html) INPUT: Given the table titled ‘table_title’, answer the following question based on the given table. The final answer should be concise and use the following format: ```json "answer": ["answer1", "answer2", ...] ``` Question: question_content Prompt Template for Table Fact Verification (Tab- Fact) TABLE: table_content (markdown / html) INPUT: Given the table titled ‘table_title’, determine whether the following statement is entailed or refuted by the given table (Output 1 for entailed and 0 for refuted): Statement: statement_content C.2 Table Serialization Formats Unless otherwise specified, we use Markdown as the default table serialization format. Figure 7 il- lustrates the two serialization formats (Markdown and HTML) used in our experiments. Markdown Format | Name | Age | City | |-------|-----|---------| | Alice | 25 | Beijing | | Bob | 30 | Shanghai| HTML Format <table> <tr><th>Name</th><th>Age</th> <th>City</th></tr> <tr><td>Alice</td><td>25</td> <td>Beijing</td></tr> <tr><td>Bob</td><td>30</td> <td>Shanghai</td></tr> </table> Figure 7: Examples of table serialization formats. C.3 Analyzed Models To provide a comprehensive analysis across differ- ent model architectures and training paradigms, we categorize the investigated LLMs into three groups: (1) Instruct LLMs.We analyze general-purpose instruction-tuned models that represent the main- stream LLM paradigm, including the Llama se- ries (Dubey et al., 2024) (e.g., Llama-3.1-8B- Instruct) and the Qwen series (Yang et al., 2024) (e.g., Qwen-2.5-7B-Instruct). (2) Table-Specific Fine-tuned LLMs. To un- derstand how specialized training affects internal mechanisms, we include models specifically op- timized for tabular tasks. We further distinguish between SFT-only models trained through super- vised fine-tuning (e.g., the TAMA series (Deng and Mihalcea, 2025)) and RL-only models optimized via reinforcement learning (e.g., the Table-R1-Zero series (Yang et al., 2025b)). (3) Mixture-of-Experts (MoE) LLMs.Beyond dense architectures, we extend our analysis to MoE models to examine whether sparse expert activa- tion leads to different table understanding behav- iors. Representative models include OLMoE-1B- 7B (Muennighoff et al., 2024), DeepSeek-V2-Lite- Chat (DeepSeek-AI et al., 2024), and Qwen3-30B- A3B (Yang et al., 2025a). D Model Analysis Details D.1 Benchmark Performance Table 4 reports the accuracy of all analyzed LLMs on our four evaluation benchmarks. Several obser- vations emerge: (1) larger models generally achieve better performance; (2) table-specific fine-tuned models (TAMA, TableGPT2, Table-R1) show no- table improvements on hierarchical table datasets (HiTab, AITQA); and (3) RL-tuned models (Table- R1 series) demonstrate particularly strong perfor- mance on complex table structures. Table 5 reports the performance of representa- tive models under different input format and infer- ence strategy combinations, with 95% bootstrap confidence intervals (1,000 iterations,N=2,000 per condition). All CoT vs. Direct differences are statistically significant (non-overlapping CIs), con- firming the reliability of prompting strategy con- clusions. Markdown vs. HTML differences fall within the margin of uncertainty for most models, consistent with the observation that format-specific differences in early layers gradually converge in deeper layers, resulting in minimal overall perfor- mance gaps. 15 ModelWTQTabFactHiTabAITQA Instruct LLMs LLaMA3.2-1B-Instruct6.005.406.409.80 LLaMA3.2-3B-Instruct33.0058.0030.8052.00 LLaMA3.1-8B-Instruct50.2070.8049.0057.80 Qwen2.5-3B-Instruct52.4078.0049.8055.60 Qwen3-4B-Instruct-250752.0075.2049.6060.00 Qwen2.5-7B-Instruct51.0073.0053.4071.20 Qwen2.5-14B-Instruct56.4088.6055.8074.40 Qwen2.5-32B-Instruct64.6085.8068.8083.60 Table-Specific Fine-tuned LLMs TAMA-Qwen2.552.4074.8056.8071.60 TAMA50.6072.6066.8082.00 TableGPT256.8074.8059.8084.20 Table-R1-7B-Zero53.8076.2063.8085.20 Table-R1-8B-Zero60.4070.4074.0078.60 MoE LLMs OLMoE-1B-7B-Instruct6.822.205.0114.60 DeepSeek-V2-Lite-Chat28.6063.4030.4053.00 Qwen3-30B-A3B-Instruct-250756.4078.2057.4073.20 Table 4: LLMs’ performance (accuracy %) on evaluation data of four benchmarks. Best results are bolded. Table 5: Performance (%) with 95% bootstrap confi- dence intervals under different input formats and infer- ence strategies.†marks statistically significant differ- ences between CoT and Direct (non-overlapping CIs). MarkdownHTML ModelDirectCoTDirectCoT Instruct LLMs Qwen2.5-7B62.2±2.167.2±2.0 † 61.4±2.167.8±2.0 † Llama-3.1-8B57.0±2.262.4±2.1 † 58.0±2.158.4±2.1 † Table-Specific Fine-tuned LLMs Table-R1-7B69.8±2.082.3±1.7 † 70.3±1.981.6±1.7 † Table-R1-8B70.8±2.082.2±1.7 † 69.9±2.180.6±1.8 † MoE LLMs Qwen3-30B-A3B66.3±2.170.2±2.0 † 68.3±2.072.1±1.9 † D.2 Analysis of Effective Depth We extend the effective depth analysis from Sec- tion 5 to all analyzed models. Table 6 summarizes the layer effectiveness statistics, while Figure 27 provides detailed layer-wise prediction stability curves for each model. As described in the main text, we define the effective layer as the first layer where the top-5 token overlap between the intermediate prediction and the final output exceeds 0.3. The effective ratio (Eff. Ratio = Eff. Layers / Total Layers) indicates how deeply the model processes before reaching a stable prediction; a lower ratio suggests earlier stabilization. Key observations from the extended analysis: • Most models stabilize their predictions in the later 75–90% of their total layers, confirming that table tasks require deep processing. •Table-specific fine-tuning (TAMA, Table-R1) does not significantly alter the effective depth compared to their base models, suggesting that layer-wise functionality is largely deter- mined during pre-training. •MoE models (OLMoE, DeepSeek-V2-Lite, Qwen3-30B-A3B) exhibit similar effective ra- tios to dense models of comparable capacity. E More Discussions E.1 Connection to Retrieval Heads Recent work has identified retrieval heads which is a sparse set of attention heads responsible for copying relevant information from context to out- put (Wu et al., 2024; Xiao et al.). Given that table 16 ModelTotal LayersEff. LayersEff. Ratio Instruct LLMs Llama-3.2-1B-Instruct16140.88 Llama-3.2-3B-Instruct28230.82 Meta-Llama-3.1-8B-Instruct32250.78 Qwen2.5-3B-Instruct36330.92 Qwen2.5-7B-Instruct28250.89 Qwen2.5-14B-Instruct48440.92 Qwen2.5-32B-Instruct64570.89 Qwen3-4B-Instruct-250736240.67 Table-Specific Fine-tuned LLMs TableGPT2-7B28250.89 TAMA-QWen2.528250.89 Table-R1-Zero-7B28250.89 TAMA-1e-632250.78 Table-R1-Zero-8B32240.75 MoE LLMs OLMoE-1B-7B-0924-Instruct16140.88 DeepSeek-V2-Lite-Chat27210.78 Qwen3-30B-A3B-Instruct-250748450.94 Table 6: Layer effectiveness analysis on table-related tasks using LogitLens. Eff. Layers identifies the first layer where the predicted vocabulary distribution aligns with the final output (defined by the top-5 token overlap first exceeds 0.3). Eff. Ratio (Eff. Layers / Total Layers) reflects the stage at which the model stabilizes its prediction; a lower ratio indicates that the model reaches its final decision in its shallower layers. QA inherently requires locating and extracting spe- cific cell values, we hypothesize that table under- standing may share similar mechanisms. To investigate this, we compare the attention heads most critical for table tasks (measured by attention norm intensity) with the Top-32 retrieval heads identified by Wu et al. (2024) using our anal- ysis tabular data. As shown in Figure 9, the re- trieval heads (marked with red boxes) predomi- nantly align with regions of high attention norm intensity (yellow areas) across all examined mod- els. This overlap is particularly pronounced in the middle-to-late layers: for instance, in Qwen2.5- 7B, retrieval heads cluster around layers 19–27; in LLaMA3.1-8B, they concentrate in layers 14–15 and 26–28; in TAMA and Table-R1-8B, similar patterns emerge in layers 13–17 and 21–28. Notably, this alignment pattern persists across both vanilla instruction-tuned models (Qwen2.5, LLaMA3) and table-specific fine-tuned models (TAMA, Table-R1), suggesting that table-specific training does not fundamentally alter which heads are responsible for information retrieval—it may instead enhance the efficiency of these pre-existing circuits. This finding suggests that LLMs leverage their pre-existing retrieval mechanisms for table under- standing, rather than developing entirely separate circuits. It also has practical implications: KV cache compression techniques designed for re- trieval heads (Xiao et al.) may be directly appli- cable to accelerate table reasoning tasks without significant performance degradation. F Attention Dynamics for All Models To validate the generalizability of our findings in Section 4, we present the complete attention dy- namics analysis for all 16 LLMs examined in this study. Each figure follows the same layout as Fig- ure 1: upper panels show case-level attention pat- terns at selected layers, while lower panels dis- play aggregated metrics including segment atten- tion ratio, table attention entropy, and attention contribution across all layers. Despite variations in model architecture (dense vs. MoE), scale (1B to 32B), and training paradigm (instruction-tuning, 17 TABLE: | Rank | Nation | Gold | Silver | Bronze | Total | | --- | --- | --- | --- | --- | --- | | 1 | China | 13 | 9 | 13 | 35 | | 2 | Japan | 7 | 10 | 7 | 24 | | 3 | Uzbekistan | 1 | 2 | 3 | 6 | | 4 | Kazakhstan | 2 | 2 | 0 | 4 | | 5 | North Korea | 1 | 0 | 1 | 2 | | 6 | South Korea | 0 | 0 | 2 | 2 | | Total | Total | 24 | 23 | 26 | 73 | INPUT: Given the table titled 'Figure skating at the Asian Winter Games', Answer the following question based on the given table. The final answer should be concise and use the following format: ```json "answer": [ "answer1", "answer2", ... ] ``` Question: which nation won the most gold medals? Qwen2.5-7B-Instruct: ```json \"answer\": [ \"China\" ] ``` Meta-LLaMA-3.1-8B-Instruct: \"answer\": [ \"China\" ] Prompt Response Figure 8: The qualitative case example for Q1. SFT, RL), all models exhibit the consistent three- phase pattern: dispersed attention in early layers, concentrated focus on query-relevant cells in mid- dle layers, and amplified contributions in late layers. Notably, table-specific fine-tuned models (TAMA, Table-R1, TableGPT2) show slightly higher atten- tion ratios to table content compared to their base models, while MoE models (DeepSeek-V2-Lite, OLMoE, Qwen3-30B-A3B) display more gradual entropy transitions across layers. 18 Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Layers QWEN2.5-3B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Layers QWEN2.5-7B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Layers TAMA.QWEN2.5 Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Layers TABLE-R1-7B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Layers LLAMA3.2-3B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Layers LLAMA3.1-8B Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Layers TAMA Attention Heads 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Layers TABLE-R1-8B Top-32 Key Retrieval Heads Localization across LLMs Attention Norm IntensitySearch-Located Key Retrieval Heads Figure 9: Top-32 retrieval heads’ overlap with attention norm heatmaps across LLMs. Background color indicates attention norm intensity (yellow = higher); red boxes mark the Top-32 retrieval heads identified by prior work. The substantial overlap, especially in middle-to-late layers, suggests that table understanding reuses pre-existing retrieval circuits. Q: which nation won the most gold medals? M: Qwen3-30B-A3B-Instruct (MoE) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L4 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L4 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L12 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L12 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L22 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L34 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L34 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L47 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L47 Table Cells Attn Distri 010203040 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 010203040 Layer Index 3.75 4.00 4.25 4.50 Entropy Table Attn Entropy 010203040 Layer Index 0.0 0.5 1.0 1.5 2.0 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 10: Attention dynamics of Qwen3-30B-A3B-Instruct-2507. 19 Q: which nation won the most gold medals? M: TAMA (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L10 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L14 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L14 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L25 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L25 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L31 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L31 Table Cells Attn Distri 051015202530 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 051015202530 Layer Index 4.4 4.6 4.8 5.0 Entropy Table Attn Entropy 051015202530 Layer Index 0.00 0.25 0.50 0.75 1.00 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 11: Attention dynamics of TAMA. Q: which nation won the most gold medals? M: Table-R1-Zero-8B (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L10 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L14 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L14 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L25 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L25 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L31 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L31 Table Cells Attn Distri 051015202530 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 051015202530 Layer Index 4.4 4.6 4.8 5.0 Entropy Table Attn Entropy 051015202530 Layer Index 0.0 0.2 0.4 0.6 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 12: Attention dynamics of Table-R1-Zero-8B. 20 Q: which nation won the most gold medals? M: Table-R1-Zero-7B (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L10 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L15 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L22 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L27 Table Cells Attn Distri 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0510152025 Layer Index 4.0 4.5 5.0 Entropy Table Attn Entropy 0510152025 Layer Index 0 2 4 6 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 13: Attention dynamics of Table-R1-Zero-7B. Q: which nation won the most gold medals? M: TAMA-Qwen2.5 (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L10 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L15 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L22 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L27 Table Cells Attn Distri 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0510152025 Layer Index 4.0 4.5 5.0 Entropy Table Attn Entropy 0510152025 Layer Index 0.0 2.5 5.0 7.5 10.0 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 14: Attention dynamics of TAMA-Qwen2.5. 21 Q: which nation won the most gold medals? M: Qwen3-4B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L3 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L3 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L10 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L20 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L20 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L30 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L30 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L35 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L35 Table Cells Attn Distri 05101520253035 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 05101520253035 Layer Index 2 3 4 5 Entropy Table Attn Entropy 05101520253035 Layer Index 0 2 4 6 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 15: Attention dynamics of Qwen3-4B-Instruct. Q: which nation won the most gold medals? M: Qwen2.5-3B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L17 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L17 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L27 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L33 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L33 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L35 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L35 Table Cells Attn Distri 05101520253035 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 05101520253035 Layer Index 2 3 4 5 Entropy Table Attn Entropy 05101520253035 Layer Index 0 2 4 6 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 16: Attention dynamics of Qwen2.5-3B-Instruct. 22 Q: which nation won the most gold medals? M: Qwen2.5-14B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L18 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L18 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L29 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L29 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L40 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L40 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L47 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L47 Table Cells Attn Distri 010203040 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 010203040 Layer Index 4.0 4.5 5.0 5.5 Entropy Table Attn Entropy 010203040 Layer Index 0 2 4 6 8 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 17: Attention dynamics of Qwen2.5-14B-Instruct. Q: which nation won the most gold medals? M: Qwen2.5-32B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L7 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L7 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L22 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L35 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L35 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L45 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L45 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L63 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L63 Table Cells Attn Distri 0102030405060 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0102030405060 Layer Index 4.0 4.5 5.0 5.5 Entropy Table Attn Entropy 0102030405060 Layer Index 0 5 10 15 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 18: Attention dynamics of Qwen2.5-32B-Instruct. 23 Q: which nation won the most gold medals? M: TableGPT2-7B (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L4 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L4 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L14 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L14 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L18 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L18 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L22 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L27 Table Cells Attn Distri 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0510152025 Layer Index 4.00 4.25 4.50 4.75 5.00 Entropy Table Attn Entropy 0510152025 Layer Index 0.0 2.5 5.0 7.5 10.0 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 19: Attention dynamics of TableGPT2-7B. Q: which nation won the most gold medals? M: LLaMA-3.2-3B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L10 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L15 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L22 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L22 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L27 Table Cells Attn Distri 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0510152025 Layer Index 4.6 4.8 5.0 Entropy Table Attn Entropy 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 20: Attention dynamics of LLaMA3.2-3B-Instruct. 24 Q: which nation won the most gold medals? M: LLaMA-3.2-1B-Instruct (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L2 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L2 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L8 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L8 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L12 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L12 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L15 Table Cells Attn Distri 02468101214 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 02468101214 Layer Index 4.4 4.6 4.8 Entropy Table Attn Entropy 02468101214 Layer Index 0.0 0.2 0.4 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 21: Attention dynamics of LLaMA3.2-1B-Instruct. Q: which nation won the most gold medals? M: DeepSeek-V2-Lite-Chat (MoE) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L3 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L3 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L5 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L5 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L12 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L12 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L20 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L20 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L26 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L26 Table Cells Attn Distri 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0510152025 Layer Index 3.0 3.5 4.0 Entropy Table Attn Entropy 0510152025 Layer Index 0.0 0.5 1.0 1.5 2.0 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 22: Attention dynamics of DeepSeek-V2-Lite-Chat. 25 Q: which nation won the most gold medals? M: OLMoE-1B-7B-Instruct (MoE) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L3 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L3 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L6 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L6 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L10 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L10 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L12 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L12 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L15 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L15 Table Cells Attn Distri 02468101214 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 02468101214 Layer Index 4.5 5.0 Entropy Table Attn Entropy 02468101214 Layer Index 0.0 0.1 0.2 0.3 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 23: Attention dynamics of OLMoE-1B-7B-Instruct. Q: which nation won the most gold medals? M: Gated_Attn_1B_Baseline (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L3 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L3 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L8 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L8 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L13 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L13 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L21 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L21 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L27 Table Cells Attn Distri 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0510152025 Layer Index 4.0 4.5 5.0 Entropy Table Attn Entropy 0510152025 Layer Index 0 5 10 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 24: Attention dynamics of Qwen’s 1B gated attention baseline. 26 Q: which nation won the most gold medals? M: Gated_Attn_1B_Elementwise (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L6 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L6 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L11 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L11 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L18 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L18 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L21 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L21 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L27 Table Cells Attn Distri 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0510152025 Layer Index 3.0 3.5 4.0 4.5 Entropy Table Attn Entropy 0510152025 Layer Index 0 10 20 30 40 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 25: Attention dynamics of Qwen’s 1B model with element-wise gated attention. Q: which nation won the most gold medals? M: Gated_Attn_1B_Headwise (Dense) A: China Case: Attn Dynamics Overall: Layer-wise Attn Analysis StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L0 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L0 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L6 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L6 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L13 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L13 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L18 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L18 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L21 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L21 Table Cells Attn Distri StartEnd Gen Step 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio L27 Attn Ratio Dyna RankNatioGoldSilveBronzTotal 1China1391335 2Japan710724 3Uzbek1236 4Kazak2204 5North1012 6South0022 TotalTotal24232673 L27 Table Cells Attn Distri 0510152025 Layer Index 0.0 0.2 0.4 0.6 0.8 1.0 Attn Ratio Layer-wise Segment Attn Ratio 0510152025 Layer Index 3 4 5 Entropy Table Attn Entropy 0510152025 Layer Index 0 5 10 15 L2 Norm Attn Contribution System InstructionTable ContentUser Question Figure 26: Attention dynamics of Qwen’s 1B model with Headwise Gated Attention. 27 0481216202428 Layer Index 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 KL Divergence Llama-3.2-3B-Instruct 048121620242832 Layer Index 0 2 4 6 8 10 12 14 KL Divergence Meta-Llama-3.1-8B-Instruct 048121620242832 Layer Index 0 2 4 6 8 10 12 14 KL Divergence TAMA 048121620242832 Layer Index 0 5 10 15 20 25 KL Divergence Table-R1-Zero-8B 0481216202428 Layer Index 0 5 10 15 20 25 KL Divergence TableGPT2-7B 0481216202428 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen2.5-7B-Instruct 0481216202428 Layer Index 0 5 10 15 20 25 30 35 KL Divergence TAMA-Qwen2.5 0481216202428 Layer Index 0 10 20 30 40 KL Divergence Table-R1-Zero-7B 04812162024283236 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen3-4B-Instruct 04812162024283236 Layer Index 0 5 10 15 20 25 30 35 KL Divergence Qwen2.5-3B-Instruct 0612182430364248 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen2.5-14B-Instruct 0816243240485664 Layer Index 0 5 10 15 20 25 30 35 KL Divergence Qwen2.5-32B-Instruct 0246810121416 Layer Index 0 2 4 6 8 10 12 14 16 KL Divergence Llama-3.2-1B-Instruct 0246810121416 Layer Index 0 2 4 6 8 10 12 14 KL Divergence OLMoE-1B-7B-Instruct 0369121518212427 Layer Index 0 5 10 15 20 25 30 35 KL Divergence DeepSeek-V2-Lite-Chat 0612182430364248 Layer Index 0 5 10 15 20 25 30 35 40 KL Divergence Qwen3-30B-A3B-Instruct 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap 0.0 0.2 0.4 0.6 0.8 1.0 Overlap Figure 27: Analysis results of LLMs’ effectiveness layer. 28