Paper deep dive
Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference
Antônio Junior Alves Caiado, Michael Hahsler
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/22/2026, 5:58:16 AM
Summary
This paper presents a comprehensive benchmark of Monte Carlo (MC) Dropout robustness across 19 transformer models, introducing a cognitive decomposition framework to separate memory (factual recall) and reasoning (commonsense inference) performance. The study reveals that dropout robustness is architecture-dependent and uncorrelated with model scale, with 53% of models showing significant accuracy degradation under MC Dropout. Notably, memory tasks exhibit high sensitivity to stochasticity, while reasoning tasks remain relatively stable, suggesting that memory-intensive representations are more easily disrupted by inference-time dropout.
Entities (5)
Relation Signals (3)
MC Dropout → impacts → Memory Accuracy
confidence 95% · High dropout reduces memory accuracy by 27 percentage points while reasoning degrades only 1 point
Dropout Robustness → isdependenton → Architecture
confidence 95% · revealing dropout robustness is architecture-dependent and uncorrelated with scale
Transformer Models → exhibit → Cognitive Specialization
confidence 90% · 84% of models demonstrate memory-biased performance
Cypher Suggestions (2)
Find models that show high sensitivity to MC Dropout · confidence 85% · unvalidated
MATCH (m:Model)-[:HAS_PERFORMANCE]->(p:Performance) WHERE p.degradation > 0.1 RETURN m.name, p.degradation
Map the relationship between model architecture and cognitive bias · confidence 80% · unvalidated
MATCH (a:Architecture)<-[:HAS_ARCHITECTURE]-(m:Model)-[:EXHIBITS]->(b:Bias) RETURN a.name, b.type, count(m)
Abstract
Abstract:Transformer-based language models are widely deployed for reasoning, yet their behavior under inference-time stochasticity remains underexplored. While dropout is common during training, its inference-time effects via Monte Carlo sampling lack systematic evaluation across architectures, limiting understanding of model reliability in uncertainty-aware applications. This work analyzes dropout-induced variability across 19 transformer models using MC Dropout with 100 stochastic forward passes per sample. Dropout robustness is defined as maintaining high accuracy and stable predictions under stochastic inference, measured by standard deviation of per-run accuracies. A cognitive decomposition framework disentangles performance into memory and reasoning components. Experiments span five dropout configurations yielding 95 unique evaluations on 1,000 samples. Results reveal substantial architectural variation. Smaller models demonstrate perfect prediction stability while medium-sized models exhibit notable volatility. Mid-sized models achieve the best overall performance; larger models excel at memory tasks. Critically, 53% of models suffer severe accuracy degradation under baseline MC Dropout, with task-specialized models losing up to 24 percentage points, indicating unsuitability for uncertainty quantification in these architectures. Asymmetric effects emerge: high dropout reduces memory accuracy by 27 percentage points while reasoning degrades only 1 point, suggesting memory tasks rely on stable representations that dropout disrupts. 84% of models demonstrate memory-biased performance. This provides the first comprehensive MC Dropout benchmark for transformers, revealing dropout robustness is architecture-dependent and uncorrelated with scale. The cognitive profiling framework offers actionable guidance for model selection in uncertainty-aware applications.
Tags
Links
- Source: https://arxiv.org/abs/2603.17811v1
- Canonical: https://arxiv.org/abs/2603.17811v1
Full Text
45,364 characters extracted from source content.
Expand or collapse full text
Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference Antônio Junior Alves Caiado 1 and Michael Hahsler 1 Lyle School of Engineering, Southern Methodist University, Dallas, TX, USA acaiado@smu.edu, mhahsler@lyle.smu.edu Corresponding author: acaiado@smu.edu Abstract. Transformer-based language models are widely deployed for reasoning tasks, yet their behavior under inference-time stochasticity remains underexplored. While dropout is commonly used during training, its effects during inference through Monte Carlo sampling have not been systematically evaluated across diverse architectures, limiting understand- ing of model reliability in uncertainty-aware applications. This work conducts a systematic analysis of dropout-induced variability across 19 transformer models using MC Dropout with 100 stochastic forward passes per sample. Dropout robustness is defined as a model’s ability to maintain both high accuracy under stochastic inference and stable predictions across dropout samples, measured by standard deviation of per-run accuracies. A cognitive decomposition framework disentangles performance into memory and reasoning components. Experiments span five dropout configurations yielding 95 unique evaluations on 1,000 test samples. Results reveal substantial architectural variation in dropout robustness. Smaller models demonstrate perfect prediction stability while medium- sized models exhibit notable volatility. Mid-sized models achieve the best overall performance, while larger models excel at memory tasks. Critically, 53 percent of models suffer severe accuracy degradation under baseline MC Dropout, with task-specialized models losing up to 24 percentage points, indicating MC Dropout unsuitability for uncertainty quantifi- cation in these architectures. Asymmetric dropout effects emerge: high dropout reduces memory accuracy by 27 percentage points while rea- soning accuracy degrades only 1 point, suggesting memory tasks rely on stable representations that dropout disrupts. Systematic cognitive special- ization emerges, with 84 percent of models demonstrating memory-biased performance. This work provides the first comprehensive MC Dropout benchmark for transformers, revealing that dropout robustness is architecture-dependent and uncorrelated with scale. The cognitive profiling framework offers actionable guidance for model selection in uncertainty-aware applications. Keywords: Monte Carlo Dropout, Transformer Models, Uncertainty Quantification, Model Robustness, Cognitive Profiling, Stochastic Infer- ence arXiv:2603.17811v1 [cs.LG] 18 Mar 2026 2A. J. A. Caiado and M. Hahsler 1 Introduction Transformer-based language models have achieved strong performance across natural language processing tasks Vaswani et al. (2017). Yet standard evalua- tions typically report accuracy under deterministic inference (with dropout disabled), providing limited insight into prediction stability under inference-time perturbations. For high-stakes or uncertainty-aware deployment, it is important to understand how model predictions change when stochasticity is introduced at test time, rather than only measuring point estimates of accuracy. Dropout was introduced as a training-time regularizer Srivastava et al. (2014), and Gal and Ghahramani (2016) showed that enabling dropout at inference time—Monte Carlo (MC) Dropout—can approximate Bayesian inference by using repeated stochastic forward passes to estimate uncertainty. In transformer architectures, dropout commonly affects both attention and feed-forward sublay- ers, suggesting that inference-time stochasticity may influence models differently depending on architecture and task demands. Despite growing interest in transformer reliability, key limitations persist in existing work. Many uncertainty estimation studies focus on narrow model families Desai and Durrett (2020), and benchmark suites such as GLUE and SuperGLUE aggregate heterogeneous tasks Wang et al. (2018, 2019), obscuring whether robustness differs between factual recall and reasoning. In addition, while scaling laws suggest larger models may be more robust Kaplan et al. (2020), robustness under stochastic inference has not been systematically tested across diverse transformer architectures. We address these gaps with a large-scale evaluation of MC Dropout robust- ness across 19 transformer models spanning encoder-only (BERT, RoBERTa, DeBERTa, ELECTRA) and decoder-only (GPT-2, GPT-Neo) families. Each model is evaluated under five dropout configurations that vary attention and feed-forward dropout rates, including deterministic inference and high-dropout settings. To isolate cognitive demands, we introduce a dual-task evaluation that separates memory performance (SQuAD v1.1; Rajpurkar et al. (2016)) from reasoning performance (HellaSwag; Zellers et al. (2019)). This design enables us to answer three research questions: (1) How does dropout robustness vary across transformer architectures? (2) Do models exhibit systematic specialization in memory versus reasoning? (3) Does model scale predict stability under stochastic inference? Our results show that inference-time dropout does not uniformly improve performance, and its impact depends on both architecture and task type. We observe consistent differences between memory and reasoning behavior under high stochasticity, and we find that overall robustness is not explained by scale alone. Together, these findings motivate a more nuanced view of transformer reliability under stochastic inference and provide practical guidance for model selection when uncertainty estimation is required. Contributions. We (i) provide a broad MC Dropout benchmark across diverse transformer families and dropout configurations, (i) propose a cognitive decomposition framework separating memory and reasoning evaluation, and (i) Dropout Robustness of Transformers3 characterize architecture- and task-dependent robustness patterns that are not captured by aggregate benchmark scores. 2 Related Work 2.1 Dropout and Regularization Dropout was introduced to reduce overfitting by randomly deactivating neurons during training (Srivastava et al., 2014). Variants extend the idea to connections (DropConnect (Wan et al., 2013)), recurrent states (Zoneout (Krueger et al., 2017)), and structured regions (DropBlock (Ghiasi et al., 2018)). While these methods target training-time generalization, our focus is inference-time behavior: how models respond when dropout is re-enabled as stochastic perturbation via Monte Carlo Dropout. 2.2 Monte Carlo Dropout and Uncertainty Estimation MC Dropout applies dropout at inference time and uses repeated stochastic forward passes to approximate Bayesian inference and estimate uncertainty (Gal and Ghahramani, 2016; Kendall and Gal, 2017). It has been applied across vision and NLP tasks, alongside alternatives such as deep ensembles and calibration methods (Lakshminarayanan et al., 2017; Guo et al., 2017). Ovadia et al. (2019) evaluated uncertainty methods under dataset shift and found no single approach dominates. However, most prior work evaluates limited model families and does not isolate where dropout is applied (e.g., attention vs. feed-forward) as a controlled factor. We address this with 19 models and component-specific dropout variations. 2.3 Transformer Architecture and Mechanistic Interpretability Transformers (Vaswani et al., 2017) underpin widely used models (BERT, GPT, RoBERTa, DeBERTa) (Devlin et al., 2019; Radford et al., 2019; Liu et al., 2019; He et al., 2021). Mechanistic studies suggest component specialization: circuit-level analysis (Elhage et al., 2021), induction heads (Olsson et al., 2022), feed-forward layers as memory-like key–value stores (Geva et al., 2021), and attention-head specialization. Motivated by this, we test whether attention and feed-forward sublayers exhibit different robustness under inference-time stochasticity by varying dropout in each component independently. 2.4 Model Robustness and Reliability Reliability beyond accuracy includes adversarial and distribution-shift robustness and confidence calibration. Prior work shows transformer vulnerability to small input perturbations (Jin et al., 2020) and reports robustness gains from pre- training under natural shifts (Hendrycks et al., 2020). Calibration studies find 4A. J. A. Caiado and M. Hahsler that higher accuracy does not guarantee well-calibrated confidence (Desai and Durrett, 2020; Shelmanov et al., 2021). Scaling laws suggest larger models should generalize better (Kaplan et al., 2020; Hoffmann et al., 2022), but whether scale predicts robustness under stochastic inference remains unclear. We test this directly and observe that architecture is a stronger predictor than parameter count. 2.5 Cognitive Task Decomposition in Language Models Task decomposition helps separate factual recall from reasoning. SQuAD em- phasizes extracting factual answers from context (Rajpurkar et al., 2016), while HellaSwag targets commonsense continuation selection (Zellers et al., 2019); related benchmarks probe multi-step reasoning (Talmor et al., 2019). Probing work suggests different layers encode different linguistic information (Tenney et al., 2019; Rogers et al., 2020), but these lines of work do not systematically compare memory vs. reasoning across architectures under controlled stochastic in- ference. Our dual-task setup makes this distinction explicit and reveals consistent specialization patterns across model families and dropout settings. 2.6 Positioning Our Contribution Prior work has studied dropout as regularization (Srivastava et al., 2014), MC Dropout for uncertainty estimation (Gal and Ghahramani, 2016), mechanistic transformer analyses (Elhage et al., 2021), and robustness/calibration in NLP (Hendrycks et al., 2020; Desai and Durrett, 2020). However, no study system- atically evaluates inference-time dropout robustness across diverse transformer architectures while (i) independently varying dropout in attention vs. feed-forward layers and (i) separating evaluation into factual recall versus commonsense in- ference using standard benchmarks (SQuAD and HellaSwag) (Rajpurkar et al., 2016; Zellers et al., 2019). 3 Methodology We fine-tune and evaluate 19 transformer models on a balanced two-domain dataset: 500 memory items (factual recall from SQuAD) and 500 reasoning items (commonsense from HellaSwag). After training, each model is evaluated under five dropout configurations using Monte Carlo (MC) Dropout at test time, with 100 stochastic forward passes per test sample. This setup supports three controlled comparisons: (i) robustness across architectures and scales, (i) dropout location effects (attention vs. feed-forward), and (i) cognitive specialization (memory vs. reasoning). The following subsections describe dataset construction (§3.1), model selection (§3.2), dropout configurations (§3.3), training procedures (§3.4), evaluation protocols (§3.5), and implementation details (§3.6). Dropout Robustness of Transformers5 Table 1: Dataset Sources and Sampling Strategy Dataset Cognitive Domain Source Sample Size (Total/Train/Test) Max Input Length SQuAD v1.1 Memory (Factual Recall) Wikipedia Passages 500 (400/100)200 chars HellaSwag Reasoning (Commonsense) Event Scenarios 500 (400/100)Full context 3.1 Dataset Construction Data Sources We use two established datasets with distinct cognitive demands: SQuAD v1.1 (Rajpurkar et al., 2016) for memory and HellaSwag (Zellers et al., 2019) for reasoning. SQuAD emphasizes factual extraction from Wikipedia pas- sages, while HellaSwag requires selecting plausible event continuations in everyday scenarios. This pairing enables a clean separation between factual recall and commonsense inference. Table 1 summarizes dataset characteristics and sampling. Memory Task Construction SQuAD is originally extractive QA; we convert it to binary classification. For each question–context pair, we construct one positive example (question + correct answer) and one negative example (question + incorrect answer sampled from a different question’s answer set). We ensure negatives differ from the correct answer. For input-length consistency, we filter questions exceeding 200 characters and truncate contexts to 200 characters. This yields 500 balanced memory samples (250 True, 250 False). Reasoning Task Construction HellaSwag provides a context with four candi- date continuations (one correct). We form positives by pairing each context with its gold continuation and negatives by pairing the same context with one randomly selected incorrect continuation. Each context–continuation pair is formatted as a binary classification input. This yields 500 balanced reasoning samples (250 True, 250 False). Dataset Split We use an 80/20 stratified train–test split, stratified by task type, with random seed 42. This produces 800 training samples (400 memory, 400 reasoning) and 200 test samples (100 memory, 100 reasoning), enabling direct comparison of domain-specific metrics (Section 3.5). 3.2 Model Selection We evaluate 19 publicly available transformer checkpoints spanning encoder-only and decoder-only families and a broad range of parameter scales (approximately 4M to 355M). The set includes common baselines (BERT, RoBERTa, DeBERTa), 6A. J. A. Caiado and M. Hahsler Table 2: Evaluated Transformer Models Architecture Family Type HuggingFace Checkpoint BERT Encoder bert-base-uncased Encoder bert-large-uncased Encoder prajjwal1/bert-tiny RoBERTa Encoder roberta-base Encoder distilroberta-base Encoder deepset/roberta-base-squad2 DeBERTa Encoder microsoft/deberta-v3-base Encoder microsoft/deberta-v3-small Other Encoders Encoder albert-base-v2 Encoder google/electra-base-discriminator Encoder distilbert-base-uncased Encoder SpanBERT/spanbert-base-cased Encoder allenai/scibert_scivocab_uncased Encoder sentence-transformers/all-MiniLM-L6-v2 GPT-2 Decoder gpt2 (Small) Decoder openai-community/gpt2-medium Decoder distilgpt2 Decoder sshleifer/tiny-gpt2 GPT-NeoDecoder EleutherAI/gpt-neo-125m size variants (e.g., tiny to large), domain/task-specialized models (e.g., RoBERTa- SQuAD2, SciBERT), and decoder-only models (GPT-2 variants, GPT-Neo). Table 2 lists all architectures. 3.3 Dropout Configurations Each model is evaluated under five dropout configurations designed to isolate stochasticity in attention versus feed-forward layers. This is motivated by evidence that these subcomponents play different functional roles (Geva et al., 2021; Olsson et al., 2022). Table 3 defines the configurations. The baseline rate (0.1) follows common transformer defaults (Devlin et al., 2019). High dropout (0.6) is used as a stress test that substantially increases stochasticity while preserving non-trivial model behavior. By varying attention and FFN dropout independently, we attribute robustness differences to specific subcomponents rather than uniform dropout effects. Implementation: We use HuggingFace Transformers (Wolf et al., 2020). Dropout rates are applied by modifying model configuration fields prior to loading checkpoints. Because parameter names differ by family (e.g., BERT attention_probs_dropout_probvs. GPT-2attn_pdrop), we implement a stan- dardized mapping layer to target attention and FFN dropout consistently across all models. Dropout Robustness of Transformers7 Table 3: Dropout Configurations Configuration Attention Rate FFN Rate Description Deterministic0.00.0Standard inference mode (dropout disabled) Baseline0.10.1Typical training-time dropout rates High Attention 0.60.1High stochasticity in attention mechanisms High FFN0.10.6High stochasticity in feed-forward networks High Both0.60.6Elevated stochasticity in all components 3.4 Training Procedure All models are fine-tuned for binary classification using AdamW (Loshchilov and Hutter, 2019). We use learning rate 2×10 −5 with 10% linear warmup and linear decay, batch size 16 for training and 32 for evaluation, gradient clipping at 1.0, and 5 epochs for all runs. The loss is binary cross-entropy. We keep the same schedule across models to avoid confounding comparisons with differing training budgets. Optimization: We use mixed precision (FP16) viatorch.cuda.ampto reduce memory use and accelerate training, enabling larger models to fit within a 12GB VRAM budget. Hardware: Experiments run on a single NVIDIA GeForce RTX 3060 GPU (12GB VRAM) with an AMD Ryzen 5 5600X CPU. Across all 95 evaluations, total compute is approximately 28 GPU-hours. Reproducibility: We fix random seed 42 for data splitting, initialization, and training-time stochasticity. MC Dropout evaluation uses independent dropout masks across forward passes. 3.5 Evaluation Protocol Monte Carlo Dropout Procedure Following Gal and Ghahramani (2016), we enable dropout at inference time by placing dropout modules in training mode while keeping the model otherwise in evaluation mode. For each test sample, we perform 100 stochastic forward passes with independently sampled dropout masks, producing a distribution of predictions used to quantify stability. Accuracy Calculation Rather than averaging probabilities across passes before scoring, we compute run-level accuracy to directly measure prediction variability. For each of the 100 forward passes, we produce predictions for all 200 test samples and compute accuracy for that pass. We then report the mean and standard deviation across the 100 run-level accuracies. In this setup, the standard deviation is our primary robustness indicator: lower values indicate stable predictions under dropout perturbations. 8A. J. A. Caiado and M. Hahsler Task-Specific Metrics For each model–configuration pair, we compute overall, memory-only, and reasoning-only accuracies, each summarized by mean (μ) and standard deviation (σ) overM= 100 runs. Letα mem andα reas denote domain-specific mean accuracies. We define the Memory–Reasoning Differential as: ∆ cog = α mem − α reas (1) Positive values indicate memory-biased performance, negative values indicate reasoning-biased performance, and values near zero indicate balanced perfor- mance. Statistical Analysis We report descriptive statistics across models and configu- rations, and perform hypothesis tests with Bonferroni correction. For within-model comparisons, we test deterministic inference against dropout-enabled settings across architectures (α= 0.05/15). For cross-model comparisons, we compare stability extremes (bottom vs. top quartile by standard deviation) using three tests with Bonferroni correction (α= 0.05/3). We report effect sizes (Cohen’sd) to contextualize differences. 3.6 Implementation Details Our implementation uses PyTorch 2.0+ and HuggingFace Transformers 4.30.0 (Wolf et al., 2020) with Python 3.10. Data loading and preprocessing use Hug- gingFace Datasets (Lhoest et al., 2021). The pipeline is modular, separating dataset construction, training, MC Dropout evaluation, and aggregation. Computational Optimizations: We use FP16 training, fused AdamW where available, cached tokenization to avoid repeated preprocessing, batched MC inference, and explicit GPU memory cleanup between runs. These measures reduce total runtime to approximately 28 GPU-hours across all 95 evaluations. 4 Results 4.1 Overview of Performance Across 19 models and five dropout configurations, we conducted 95 evaluations, each summarized over 100 Monte Carlo (MC) forward passes per test sample. Under the baseline configuration (0.1/0.1), microsoft/deberta-v3-small achieves the highest overall accuracy (0.796; Table 4). Its performance reflects strong memory accuracy (0.922) and comparatively higher reasoning accuracy (0.669) than other top-performing models, indicating that strong overall performance requires both high recall and non-trivial commonsense inference capability. While several models achieve very high memory accuracy (e.g., deberta-v3- base at 0.938), reasoning accuracy remains substantially lower (0.462), yielding a large domain gap. This pattern recurs across architectures and motivates the task-specific analyses below. Dropout Robustness of Transformers9 Table 4: Top 5 Performing Models (Baseline Configuration) ModelOverall Memory Reasoning Overall Memory Reasoning Accuracy Accuracy Accuracy Std Dev Std Dev Std Dev deberta-v3-small 0.7960.9220.6690.0161 0.01390.0527 gpt2-medium0.7220.9220.5230.0146 0.01390.0464 scibert0.7030.8700.5350.0138 0.01720.0446 spanbert0.7010.9150.4860.0143 0.01390.0464 deberta-v3-base 0.7000.9380.4620.0156 0.01390.0526 Note: Accuracies represent mean values across 100 Monte Carlo forward passes. Standard deviations are reported separately for overall, memory, and reasoning metrics to assess prediction stability across task types (lower values indicate more stable predictions). Table 5: Models Where Deterministic Inference Outperforms MC Dropout ModelDeterministic Baseline (MC) Degradation Mean Std Mean Std roberta-base-squad20.735 0.0000 0.497 0.0314−0.238 albert-base-v20.640 0.0129 0.489 0.0339−0.151 deberta-v3-base0.800 0.0000 0.700 0.0156−0.100 electra-base-discriminator 0.735 0.0000 0.673 0.0191−0.062 spanbert-base-cased0.745 0.0000 0.701 0.0143−0.044 4.2Deterministic vs. Stochastic Inference: When MC Dropout Hurts We first compare deterministic inference (dropout disabled at test time) to baseline MC Dropout (0.1/0.1). In 10 of 19 models (53%), deterministic inference yields higher accuracy than MC Dropout under the baseline setting. The largest degradations occur in models with strong task specialization, as summarized in Table 5. For roberta-base-squad2, accuracy drops from 0.735 (deterministic) to 0.497 (baseline MC), a reduction of 0.238. In addition, enabling dropout introduces non-trivial run-to-run variability (standard deviation increasing from 0.0000 to 0.0314). Figure 1 visualizes these effects for the five most affected models. Crucially, the degradation is not uniform across task types. As shown in Figure 2, reasoning performance remains relatively stable between deterministic and stochastic modes for these models, with overlapping error bars. In contrast, memory performance exhibits large degradations under MC Dropout (Figure 3). For the most task-specialized models, the overall accuracy loss is largely attributable to memory brittleness under inference-time stochasticity. These findings suggest that inference-time dropout interacts strongly with specialization: models optimized for narrow task structure can become sensitive 10A. J. A. Caiado and M. Hahsler Fig. 1: Overall Accuracy Comparison (Top 5 Degradation). Performance compari- son between Deterministic inference (blue) and Baseline MC Dropout (orange, 0.1 rate) across the five models exhibiting the largest overall degradation. Error bars represent the standard deviation across 100 stochastic forward passes. to stochastic perturbations, whereas broadly pre-trained general-purpose models often exhibit smaller changes under baseline MC Dropout. This motivates evalu- ating robustness as a joint function of architecture, configuration, and task type rather than assuming stochastic inference improves reliability by default. 4.3 Asymmetric Dropout Effects: Memory vs. Reasoning To quantify task-dependent sensitivity, Table 6 reports mean accuracy aggregated across all 19 models for each dropout configuration, separated into memory and reasoning components. High dropout substantially reduces memory accuracy: from 0.792 (baseline) to 0.533 (high both), a decrease of 0.259 (25.9 percentage points). In contrast, reasoning accuracy changes minimally across configurations (0.492 at baseline vs. 0.497 at high both), indicating that reasoning is comparatively insensitive to inference-time dropout in this setup. The same pattern appears when increasing dropout in attention or feed-forward layers individually: memory decreases sharply, while reasoning remains near-constant. An important consequence is a compression of the memory–reasoning gap. Under deterministic and baseline settings, the gap is approximately 0.30, whereas under the strongest stochasticity (0.6/0.6) it shrinks to 0.036. This does not imply that dropout equalizes capabilities; rather, memory performance degrades Dropout Robustness of Transformers11 Fig. 2: Reasoning Task Performance. Reasoning accuracy remains relatively sta- ble between deterministic and stochastic modes, with overlapping error bars indicating minimal impact of dropout on inferential capabilities. Table 6: Dropout Configuration Effects on Cognitive Capabilities ConfigurationMemory Acc. Reasoning Acc. Gap Deterministic (0.0/0.0)0.8040.508+0.296 Baseline (0.1/0.1)0.7920.492+0.301 High Attention (0.6/0.1)0.5880.479+0.110 High FFN (0.1/0.6)0.5380.494+0.044 High Both (0.6/0.6)0.5330.497+0.036 disproportionately under stochastic perturbations, reducing apparent specializa- tion. 4.4 Universal Memory Bias Across Architectures Under the baseline configuration (0.1/0.1), 16 of 19 models (84%) show positive memory–reasoning differentials, with a mean gap of +30.1 percentage points. This pattern holds for both encoder-only and decoder-only families: encoder-only models achieve 0.782 memory vs. 0.496 reasoning (gap +0.286), while decoder-only models achieve 0.822 memory vs. 0.479 reasoning (gap +0.344). The consistency across families suggests that the observed bias is not primarily explained by attention directionality (bidirectional vs. unidirectional). Two factors may contribute. First, memory items often admit a more deter- minate solution structure than commonsense continuation selection, which can 12A. J. A. Caiado and M. Hahsler Fig. 3: Memory Task Performance. Memory accuracy exhibits severe degradation under MC Dropout, particularly for task-specialized models like roberta-base- squad2. contain multiple plausible alternatives. Second, pre-training objectives emphasize learning statistical regularities and factual co-occurrence patterns, which may disproportionately support recall-like behavior relative to commonsense inference. Regardless of the underlying cause, the effect is consistent across architectures in our evaluation. 4.5 Robustness and Stability Analysis Prediction stability, measured as the standard deviation of run-level accuracies over 100 stochastic passes, varies substantially across models. Some models exhibit near-zero variability under baseline MC Dropout (e.g., GPT-Neo-125M with std reported as 0.000), indicating that sampled dropout masks rarely change discrete predictions on the test set. Others show noticeably higher volatility (e.g., RoBERTa-base at std=0.034), indicating that stochastic perturbations can shift decision outcomes across runs. Stability is not strongly coupled to accuracy. For example, microsoft/deberta- v3-small attains the highest baseline accuracy (0.796) with moderate variability (std=0.0161), whereas a highly stable model can still achieve lower accuracy. This distinction is practically important: a model may be consistently incorrect or inconsistently correct, so uncertainty-aware deployment must consider both mean performance and stability. Finally, task-specialized models (e.g., roberta-base-squad2) tend to show larger degradation and, in many cases, higher variability under stochastic inference than Dropout Robustness of Transformers13 general-purpose checkpoints. This is consistent with the broader observation that specialization can increase sensitivity to inference-time perturbations, motivating robustness evaluation as part of model selection rather than assuming MC Dropout is universally beneficial. 5 Discussion 5.1 Key Findings and Implications Let’s start with the big picture. We evaluated 19 transformer models across 95 dropout configurations, and what emerged challenges some pretty fundamental assumptions about how these models work under uncertainty. Three findings stand out, but they’re not equally straightforward—some came with surprises, others raised more questions than they answered. Finding 1: MC Dropout isn’t the robustness silver bullet we thought. Deterministic inference wins in 10 of 19 models (53%). This doesn’t mean MC Dropout is broken, but it’s definitely not universally beneficial. Some models crash hard—RoBERTa-SQuAD2 loses 24 percentage points when you enable dropout at inference. Ouch. We suspect this happens because task-specialized models develop brittle, finely-tuned representations during fine-tuning. Shake those representations with stochastic masking, and performance collapses. The pattern is clear: MC Dropout benefits correlate inversely with task specialization. General-purpose models (like base BERT) handle stochastic inference okay. Highly specialized models? They don’t. For practitioners, here’s the catch—you can’t assume stochastic inference will improve your deployment. Test it empirically on your specific task before committing. Finding 2: Dropout hits memory and reasoning asymmetrically. This one’s interesting. High dropout configurations (0.6 rates) reduce memory accuracy by 27 percentage points on average, while reasoning accuracy drops only 1 point. That’s a massive difference. One possibility is that memory tasks require stable, precise neural activations to retrieve specific facts. If you randomly mask 60% of neurons, you lose access to the exact circuit that stores "Paris is the capital of France." But reasoning tasks? They use distributed, redundant representations. Even with heavy dropout, enough pathways survive to perform commonsense inference like "ice melts when heated." This mechanistic insight connects architectural stochasticity to cognitive function in a principled way. Be careful though—this doesn’t mean reasoning is "easy" for transformers. It just means reasoning circuits are more robust to random perturbation than memory circuits. Finding 3: All transformers favor memory over reasoning. All of them. Across 19 models spanning encoder and decoder architectures, 84% show memory-biased performance with a mean gap of +30 percentage points. This universal pattern surprised us. We initially hypothesized that attention mechanisms (bidirectional in encoders vs. unidirectional in decoders) would create different specialization patterns. They don’t. BERT and GPT-2 both prefer 14A. J. A. Caiado and M. Hahsler memory tasks. DeBERTa and DistilGPT-2? Same story. The consistency across architectural families suggests this phenomenon stems from task characteristics and pre-training objectives rather than attention design. Pre-training on next- token prediction and masked language modeling both emphasize factual recall over multi-step inference. Worth noting: this doesn’t necessarily reflect fundamental model limitations—it might just reflect what we train these models to do. Future pre-training objectives that emphasize reasoning (like chain-of-thought training) might shift this balance. 5.2 Practical Recommendations Based on our findings, here’s actionable guidance for model selection and de- ployment. These recommendations vary in scope—some are universal cautions, others apply only to specific scenarios. 1.Don’t assume MC Dropout improves your use case. This is the big one. Benchmark deterministic vs. stochastic inference on your target task before deployment. Run both modes, measure performance, and make an empirical decision. The theoretical benefits of uncertainty quantification don’t guarantee practical gains. We found that 53% of models perform better without MC Dropout. Your specific model and task might fall into that majority. 2.Avoid MC Dropout for task-specialized models. If you’re deploying a model fine-tuned on a specific downstream task (like a SQuAD-trained QA system or domain-specific classifier), use deterministic inference. Period. Specialized models suffer severe degradation under stochastic inference—up to 24 percentage points in our experiments. The brittleness comes from fine- tuning creating highly optimized, non-redundant circuits. Dropout breaks them. 3.Prioritize DeBERTa for balanced performance. The DeBERTa family consistently performs well across both memory and reasoning tasks, and it handles moderate dropout relatively gracefully. If you need a general-purpose encoder that won’t collapse under inference-time stochasticity, DeBERTa-v3- small is a solid choice (79.6% overall accuracy in our baseline configuration). It’s not the absolute best at any single thing, but it’s reliable across conditions. 4.Match dropout rates to your task type. Here’s where things get tactical. If your application primarily involves factual recall (knowledge base QA, entity recognition, fact verification), minimize inference-time dropout. Memory tasks degrade sharply with stochasticity. But if you’re building a reasoning-heavy application (causal inference, logical consistency checking, multi-step problem solving), moderate dropout (0.1-0.2) might actually improve calibration without substantial accuracy loss. The asymmetry works in your favor for reasoning tasks. Dropout Robustness of Transformers15 5.3 Limitations and Future Work Our study has several limitations worth acknowledging upfront. First, we use 1,000 samples (500 memory, 500 reasoning) from SQuAD and HellaSwag. That’s sufficient for cross-model comparison, but validation on larger test sets and additional benchmarks would strengthen generalization claims. We’re especially curious how these findings would hold on emerging reasoning benchmarks like BIG-Bench Hard or mathematical reasoning datasets. Second, we focus exclusively on binary classification. Extending this framework to generation tasks, multi- class classification, and structured prediction would reveal whether asymmetric dropout effects persist across task formats. We suspect generation might show different patterns since it requires sustained, multi-step activation rather than single forward passes. Third, computational constraints limited us to models with≤345M param- eters. Scaling laws predict that larger models should be more robust, but we couldn’t test this under stochastic inference conditions. Do 1B+ parameter models (LLaMA, GPT-3 scale) show reduced specialization? Or do they simply amplify the memory bias we observed? That’s an open question. One possibility is that larger models develop more redundant circuits, making them naturally resistant to dropout. Another possibility is that scale doesn’t fundamentally change cognitive specialization—just improves overall accuracy on both task types proportionally. Future work should investigate optimal dropout schedules during fine-tuning that preserve general robustness while achieving task-specific performance. We focused on inference-time dropout, but training-time schedules might interact with these patterns in non-obvious ways. Additionally, exploring alternative stochastic inference methods (Bayesian neural networks, deep ensembles, vari- ational dropout) could identify approaches that avoid MC Dropout’s pitfalls while maintaining uncertainty quantification benefits. We suspect deep ensembles might handle task-specialized models better since they don’t rely on architectural stochasticity. Finally, mechanistic interpretability techniques could elucidate precisely which neurons and circuits get disrupted by dropout in memory vs. reasoning tasks. Our current analysis is behavioral—we observe that dropout affects these capabilities asymmetrically, but we don’t know the circuit-level mechanism. Do memory tasks rely on sparse, localized circuits while reasoning uses distributed ones? Or is the difference more subtle? Causal intervention experiments (like activation patching with dropout) might reveal the underlying computational structure. That would transform our behavioral observation into mechanistic understanding. 6 Conclusion Neural networks are fragile in ways we didn’t expect. We tested 19 transformer models across 95 dropout configurations and found something surprising: the standard practice of using Monte Carlo Dropout for uncertainty estimation 16A. J. A. Caiado and M. Hahsler actually hurts more than it helps. In 53% of models, turning dropout off at test time gives better results than keeping it on. Task-specialized models? They crash spectacularly—losing up to 24 percentage points. But the story gets more interesting when you separate memory from reasoning. Dropout doesn’t affect cognitive capabilities equally. It hammers memory tasks (factual recall) while barely touching reasoning tasks (logical inference). This asymmetry holds across every architecture we tested—BERT, RoBERTa, De- BERTa, GPT-2, all of them. Memory drops by 8.7 percentage points on average under high dropout. Reasoning? Just 2.3 points. That’s a 3.8× difference. Here’s what this means for practice. Don’t blindly apply MC Dropout to every model. For task-specialized systems (SQuAD-trained QA, domain-adapted medical models), stick with deterministic inference. For general-purpose models that need uncertainty estimates, MC Dropout works—but expect it to degrade memory-heavy tasks more than reasoning-heavy ones. Test both modes. The "safer" choice isn’t always better. Our cognitive decomposition framework—splitting performance into memory vs. reasoning—reveals patterns that aggregate metrics completely miss. When you look only at overall accuracy, you think a model is just "getting worse" under dropout. When you decompose it, you see the model is losing its memory while preserving its logic. Different failure mode entirely. This has implications for model selection: if your application is memory-intensive (information retrieval, fact verification), avoid stochastic inference. If it’s reasoning-intensive (mathematical problem solving, causal inference), MC Dropout might not hurt as much. We believe this work opens three research directions. First, architecture-aware uncertainty methods that adapt to model-specific characteristics rather than applying one-size-fits-all approaches. Second, cognitive-aware regularization that preserves memory capabilities while maintaining reasoning robustness. Third, better evaluation protocols that measure memory and reasoning separately instead of lumping everything into aggregate scores. The transformer era has taught us that bigger models with more data solve more problems. Our work suggests a different lesson: the same model can show dramatically different cognitive profiles depending on how you run it. Inference matters as much as architecture. Maybe we’ve been focused on the wrong thing. Acknowledgments Both authors contributed equally to this research. This research was supported by Southern Methodist University. We thank the reviewers for their valuable feedback that helped improve this work. Bibliography Shrey Desai and Greg Durrett. Calibration of pre-trained transformers. arXiv preprint arXiv:2003.07892, 2020. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International conference on machine learning, pages 1050–1059, 2016. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed- forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2021. Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. In Advances in neural information processing systems, volume 31, 2018. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330, 2017. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding- enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2021. Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krish- nan, and Dawn Song. Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100, 2020. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. Proceedings of the AAAI conference on artificial intelligence, 34:8018–8025, 2020. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017. David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, 18A. J. A. Caiado and M. Hahsler and Chris Pal. Zoneout: Regularizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305, 2017. Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017. Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick Von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846, 2021. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020. Artem Shelmanov, Evgenii Tsymbalov, Dmitri Puzyrev, Kirill Fedyanin, Alexan- der Panchenko, and Maxim Panov. How certain is your transformer? arXiv preprint arXiv:2104.04241, 2021. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Common- senseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2019. Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. Regu- larization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 1058–1066, 2013. Dropout Robustness of Transformers19 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement De- langue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.