Paper deep dive
The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models
Seonglae Cho, Zekun Wu, Kleyton Da Costa, Adriano Koshiyama
Models: GPT-2 (124M), GPT-2 Large (774M), GPT-2 Medium (355M), Gemma-2-2B, Llama-3.2-1B, Llama-3.2-3B, Mistral-7B, Qwen2-1.5B, Qwen2-7B
Abstract
Abstract:When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.
Tags
Links
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/11/2026, 12:41:31 AM
Summary
The paper characterizes the geometric structure of correctness representations in language models, finding that a discriminative confidence signal exists in a low-dimensional (3-8D) subspace of the residual stream. This signal is linearly separable and consistent across 9 models from 5 architecture families. The authors demonstrate that centroid distance in this subspace matches trained probe performance, enabling efficient few-shot detection, and validate the causal nature of this signal through activation steering.
Entities (5)
Relation Signals (4)
Centroid distance → matches → trained probe performance
confidence 95% · Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC)
Confidence Manifold → occupies → 3-8 dimensions
confidence 95% · the discriminative signal occupies 3–8 dimensions
Activation Steering → causallyvalidates → Confidence Manifold
confidence 90% · We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates
Language Models → encode → Confidence Manifold
confidence 90% · We characterize the geometry of correctness representations across 9 models
Cypher Suggestions (2)
Map the relationship between methods and their performance metrics · confidence 90% · unvalidated
MATCH (m:Method)-[r:ACHIEVES]->(metric:Metric) RETURN m.name, metric.value, r.dataset
Find models that exhibit a specific intrinsic dimension for correctness · confidence 85% · unvalidated
MATCH (m:Model)-[:HAS_PROPERTY]->(p:Property {name: 'intrinsic_dimension'}) WHERE p.value >= 3 AND p.value <= 8 RETURN m.nameFull Text
83,116 characters extracted from source content.
Expand or collapse full text
The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models Seonglae Cho 1 2 Zekun Wu 1 2 Kleyton Da Costa 1 Adriano Koshiyama 1 2 https://github.com/seonglae/confidence-manifold Abstract When a language model asserts that “the capital of Australia is Sydney,” does it know this is wrong? We characterize the geometry of correctness rep- resentations across 9 models from 5 architecture families. The structure is simple: the discrimina- tive signal occupies 3–8 dimensions, performance degrades with additional dimensions, and no non- linear classifier improves over linear separation. Centroid distance in the low-dimensional sub- space matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data ac- curacy. We validate causally through activation steering: the learned direction produces 10.9 per- centage point changes in error rates while ran- dom directions show no effect. Internal probes achieve 0.80–0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44– 0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid dis- tance matches probe performance indicates class separation is a mean shift, making detection geo- metric rather than learned. 1. Introduction Large language models produce confident-sounding out- puts regardless of factual accuracy (Rawte et al., 2023; Ji et al., 2025). A model may assert falsehoods with the same linguistic certainty as truths, undermining deployment in high-stakes domains. Prior work established that LLMs encode truth-related signals in their activations (Azaria & Mitchell, 2023; Burns et al., 2023; Marks & Tegmark, 2024), but treated this signal as a single direction to find and exploit. The underlying geometric structure (how many dimensions encode this signal, whether it admits simple decision bound- 1 Holistic AI 2 University College London. Correspondence to: Seonglae Cho <seonglae.cho@holisticai.com>. aries, and what minimal representations suffice) remained uncharacterized. We characterize the geometry of correctness representations in transformer activations (Figure 1). Analyzing 9 mod- els from 5 architecture families, we find the discriminative signal is simple: it occupies 3–8 dimensions, performance decreases with additional dimensions, and no nonlinear clas- sifier (convex hull, Mahalanobis, kernel SVM) improves over linear probes. The optimal decision boundary is a hy- perplane; complex boundaries model structure that does not exist. Prior work established that linear truth directions exist (Marks & Tegmark, 2024); we characterize what they did not: (i) discriminative rank is 3–8D, (i) adding dimen- sions hurts, (i) centroid distance matches trained probes, and (iv) this structure is consistent across 9 models from 5 architecture families. This simplicity enables a practical method. Class separa- tion is driven by a mean shift between correct and incorrect distributions: centroid distance matches probe performance (0.90 AUC), requiring only two mean vectors rather than discriminative training. On GPT-2, centroid-based detec- tion with 25 labeled examples achieves 89% of full-data performance. We validate causally via activation steering. The learned di- rection produces monotonic, 10.9 percentage point changes in error rates; random and orthogonal directions show no effect. Internal probes achieve 0.80–0.97 AUC (GroupK- Fold CV, 3 seeds) while output-based methods achieve only 0.44–0.64 AUC; the correctness signal exists internally but is not expressed in outputs (Orgad et al., 2025). Uncertainty ̸=correctness: semantic entropy achieves only 0.55 AUC because models confidently assert misconceptions. 2. Background Linear Representation Hypothesis. Neural networks en- code semantic concepts as linear directions in activation space (Mikolov et al., 2013; Park et al., 2024). The classic arithmetic⃗v king − ⃗v man + ⃗v woman ≈ ⃗v queen extends to ab- stract features in transformers (Nanda et al., 2023; Marks 1 arXiv:2602.08159v1 [cs.LG] 8 Feb 2026 The Confidence Manifold 020406080100 Layer Depth (%) 0.5 0.6 0.7 0.8 0.9 1.0 Test AUC (a) Detection Performance by Layer GPT-2-Large GPT-2-Med Mistral-7B GPT-2 Llama-3B Gemma-2B Qwen2-7B Qwen2-1.5B Llama-1B 020406080100 Layer Depth (%) 10 20 30 40 50 Intrinsic Dimension (b) Manifold Dimension by Layer GPT-2-Large GPT-2-Med Mistral-7B GPT-2 Llama-3B Gemma-2B Qwen2-7B Qwen2-1.5B Llama-1B Figure 1. Layer-wise evolution across 9 models. (a) Detection performance peaks at different depths: GPT-2 family at final layers (100%), instruction-tuned models at mid-layers (43–75%). (b) Intrinsic dimension decreases through layers, converging to 8–12D at optimal layers. & Tegmark, 2024). For conceptc, directionw c enables both extraction (h ⊤ w c correlates withc) and intervention (addingαw c steers behavior). This extends to truth (Burns et al., 2023; Marks & Tegmark, 2024) and internal states (Azaria & Mitchell, 2023; Su et al., 2024; Sriramanan et al., 2024). Recent work shows LLMs encode more truthfulness information than they express in outputs (Orgad et al., 2025), and confidence regulation involves specific neural mecha- nisms (Stolfo et al., 2024). While prior work establishes the existence of such signals, none characterize the geomet- ric structure: how many dimensions encode correctness, whether nonlinear boundaries help, and why centroid-based methods match trained classifiers. We provide this charac- terization across 9 models. Uncertainty vs. Correctness. Token entropyH(p) = − P i p i logp i conflates linguistic and epistemic uncertainty. Semantic entropy (Kuhn et al., 2023; Farquhar et al., 2024) addresses this by clusteringNgenerations by meaning via NLI, computing entropy over clusters. While effective for uncertainty estimation, it requiresNforward passes plus NLI inference. Critically, calibration evolves across lay- ers with a low-dimensional direction in the residual stream (Joshi et al., 2025), but distributional certainty alone is insuf- ficient: models can be confidently wrong on TruthfulQA’s misconception-laden questions. Semantic entropy probes (Han et al., 2024) predict SE efficiently but inherit its limi- tation of measuring uncertainty rather than correctness. Activation Steering. Inference-time intervention modifies activations:h ′ = h + αw(Li et al., 2023b; Turner et al., 2024). Contrastive activation addition (Rimsky et al., 2024) and representation engineering (Zou et al., 2023) demon- strate behavioral control for honesty and safety. Adaptive steering (Wang et al., 2025) adjusts intervention strength per-sample based on predicted uncertainty. We use steer- ing for causal validation: verifying learned directions affect outputs, not merely correlate. Sparse Autoencoders. SAEs decompose activations into sparse, interpretable features (Bricken et al., 2023; Cunning- ham et al., 2023):f = ReLU(W enc h + b), ˆ h = W dec f. Scaling to production models (Templeton et al., 2024; Lieberum et al., 2024) enables feature-level analysis. How- ever, recent work shows dense SAE latents (not sparse) capture entropy regulation (Sun et al., 2025), and SAEs are suboptimal for certain steering tasks (Arad et al., 2025). Task-specific SAE training (Kissane et al., 2024) or alter- native decomposition methods (Marks et al., 2025; Engels et al., 2025) remain promising directions. We compare SAE-based detection against raw probes. Problem Setting. Given questionqand response sentences s 1 ,...,s n , predict factual correctnessy i ∈ 0, 1for eachs i from hidden stateh (ℓ) i , enabling fine-grained fac- tuality assessment (Min et al., 2023). Goals: (1) efficient single-pass inference, (2) accurate improvement over en- tropy baselines, (3) causal validation via intervention. 3. Method Our framework extracts, characterizes, and validates the confidence manifold through four components: contrastive data construction, direction learning, geometric analysis, and causal validation. We use “confidence” to denote the model’s internal representation of correctness, not output- level uncertainty; a model may produce low-entropy outputs while encoding that the response is likely incorrect. Formulation. We hypothesize that transformers encode 2 The Confidence Manifold correctness in a low-dimensional subspaceM⊆R d of the residual stream satisfying: (1) projection ontoMpredicts correctness, (2) intervention alongMcausally affects out- puts, (3)M has consistent structure across layers/models. 3.1. Data Construction and Direction Learning Contrastive pairs.TruthfulQA provides paired cor- rect/incorrect answers per question (Lin et al., 2022), yield- ing(h + i ,h − i ) N i=1 controlling for topic and difficulty. Each pair shares the same question stem, isolating the correctness signal from content variation. We extract hidden states from the last token position (Gurnee et al., 2023; Belinkov, 2022) across all L layers. Probe training. Logistic regression withL 2 regulariza- tion (C=0.1):p(y = 1|h) = σ(w ⊤ h + b), trained with cross-entropy loss plusL 2 penalty. The learnedw (ℓ) is the confidence direction at layerℓ. We supervise on correctness labels directly, not entropy proxies, a choice validated by semantic entropy’s failure to predict correctness (§5). 3.2. Geometric Analysis We distinguish two notions of dimensionality: Intrinsic dimension (representation geometry). The Levina-Bickel MLE estimator (Levina & Bickel, 2004) measures the dimensionality of the data manifold itself: for each pointx i withk-nearest neighbors at distances T 1 <· < T k , ˆ d k (x i ) = 1 k− 1 k−1 X j=1 log T k T j −1 (1) averaged over samples andk ∈ [5, 20]. We pool correct and incorrect activations to estimate the ID of the over- all representation manifold at each layer. This reveals the embedding dimension of hidden states (8–12D at optimal layers), independent of any classification task. Discriminative dimension (classification geometry). PLS regression projects activations onto directions maximizing covariance with labels. Sweeping PLS components (1–32D) reveals how many dimensions are useful for classification. We find a 3–8D peak: additional dimensions add noise that hurts generalization, even though the representation mani- fold spans more dimensions. The gap (3–8D discriminative vs 8–12D intrinsic) indicates most manifold structure is orthogonal to the correctness signal. Grassmannian distance. To quantify cross-layer align- ment, we compute the chordal distanced G = sin|θ|where θ = arccos(| ˆ w ⊤ 1 ˆ w 2 |)and ˆ w = w/∥w∥. Low distance indicates consistent encoding across layers. Layer similarity matrix.S ij =| ˆ w (i)⊤ ˆ w (j) |reveals man- ifold structure: block patterns indicate coherent encoding regions. Procrustes alignment. For cross-model comparison with different hidden dimensionsd 1 ̸= d 2 , orthogonal Procrustes finds optimal rotationRminimizing∥W 1 R− W 2 ∥ 2 F sub- ject to R ⊤ R = I. 3.3. Causal Validation via Activation Steering To verify the confidence direction is causal, we perform inference-time intervention (Li et al., 2023b; Turner et al., 2024): h ′(ℓ ∗ ) = h (ℓ ∗ ) + α· ˆ w (ℓ ∗ ) (2) sweeping α∈ [−5, +5] and measuring error rate change. Controls. Random directionr ∼ N(0,I)and orthogonal directionr ⊥ = r− (r ⊤ ˆ w) ˆ w, both normalized. Causality requires: (1) learned direction produces monotonic effects, (2) positiveαincreases correctness, (3) controls show no systematic effect. 3.4. Why Direct Labels, Not Entropy? A methodological insight: probes trained on semantic en- tropy failed to separate incorrect from correct samples. This is because uncertainty̸=incorrectness: Models can be confidently wrong (low entropy, hallucinating) or appro- priately uncertain (high entropy on ambiguous questions). Direct labels capture what we detect; entropy proxies do not. 4. Experiments 4.1. Datasets Primary benchmark. TruthfulQA (Lin et al., 2022) pro- vides 817 questions designed to elicit false beliefs and mis- conceptions. Each question has paired correct/incorrect answers, enabling contrastive probe training. We use 80/20 stratified splits with GroupKFold cross-validation (5 folds, grouped by question) to prevent train-test leakage from para- phrased answers. Transfer evaluation. We evaluate cross-domain gener- alization on four additional datasets: SciQ (Welbl et al., 2017) (science QA, 1000 samples), CommonsenseQA (Tal- mor et al., 2019) (commonsense reasoning, 1221 samples), HaluEval (Li et al., 2023a) (hallucination detection, 10000 samples), and FEVER (Thorne et al., 2018) (fact verifica- tion, 10000 samples). Probes are trained on TruthfulQA and evaluated zero-shot on transfer datasets. 4.2. Models We evaluate 9 models across 5 architecture families span- ning 124M to 7B parameters: 3 The Confidence Manifold Table 1. Dimensionality of the confidence manifold (GroupKFold AUC). Performance peaks at 3–8 dimensions and decreases at higher dimensions. Base models (GPT-2) peak at 3D; instruction-tuned models peak at 4–8D. Model1D2D3D4D5D8D16D32DPeak Instruction-tuned Llama-1B0.847 ±.015 0.886 ±.015 0.892 ±.013 0.895 ±.017 0.893 ±.015 0.878 ±.025 0.831 ±.038 0.813 ±.041 4D Qwen2-1.5B0.814 ±.021 0.848 ±.017 0.858 ±.018 0.863 ±.020 0.864 ±.026 0.873 ±.031 0.842 ±.033 0.810 ±.041 8D Gemma-2B0.817 ±.030 0.846 ±.024 0.861 ±.027 0.860 ±.029 0.856 ±.036 0.830 ±.047 0.780 ±.052 0.751 ±.053 3D Llama-3B0.887 ±.018 0.914 ±.016 0.917 ±.023 0.917 ±.020 0.919 ±.028 0.910 ±.032 0.893 ±.039 0.881 ±.038 5D Qwen2-7B0.828 ±.011 0.882 ±.009 0.892 ±.010 0.894 ±.011 0.896 ±.011 0.890 ±.015 0.876 ±.021 0.843 ±.030 5D Mistral-7B0.860 ±.019 0.888 ±.014 0.899 ±.014 0.902 ±.016 0.902 ±.017 0.899 ±.019 0.876 ±.023 0.849 ±.030 5D Base models (GPT-2) GPT-20.718 ±.003 0.750 ±.019 0.758 ±.016 0.753 ±.019 0.746 ±.023 0.720 ±.028 0.672 ±.048 0.621 ±.034 3D GPT-2-Med0.747 ±.009 0.781 ±.011 0.790 ±.011 0.784 ±.012 0.778 ±.014 0.763 ±.025 0.723 ±.031 0.676 ±.036 3D GPT-2-Large0.750 ±.010 0.784 ±.015 0.791 ±.011 0.783 ±.016 0.769 ±.013 0.741 ±.018 0.700 ±.022 0.673 ±.020 3D Base models. GPT-2 family (Radford et al., 2019): GPT- 2 (124M, 12 layers), GPT-2-Medium (355M, 24 layers), GPT-2-Large (774M, 36 layers). These autoregressive mod- els provide a controlled comparison across scales within a single architecture. Instruction-tuned models. Qwen2 (Yang et al., 2024) (1.5B and 7B), Mistral-7B-Instruct (Jiang et al., 2023), Llama-3.2 (Grattafiori et al., 2024) (1B and 3B), and Gemma-2-2B-it (Team et al., 2024). These models enable comparison between base and instruction-tuned representa- tions. 4.3. Baselines Output-based methods. (1) P(True): probability assigned to “Yes” when asked if the answer is correct; (2) NLL: nega- tive log-likelihood of the answer tokens; (3) Token entropy: H(p) = − P i p i logp i over next-token distribution; (4) Verbalized confidence: model’s self-reported confidence on 1–10 scale; (5) Semantic entropy (Farquhar et al., 2024): entropy over semantically-clustered generations (5 samples, NLI clustering). Unsupervised methods. (1) CCS (Burns et al., 2023): con- trastive consistency search for truth directions; (2) L2 norm: activation magnitude; (3) Reconstruction error: autoencoder residual; (4) LOF score: local outlier factor for anomaly de- tection; (5) Cluster uncertainty: distance to nearest cluster centroid. 4.4. Probing Protocol Feature extraction. We extract residual stream activations at the last token position (Gurnee et al., 2023; Belinkov, 2022) across all layers. Hidden states are extracted using standard forward hooks. Classifier. Logistic regression with L 2 regularization (C=0.1) trained on last-token activations. We evaluate with GroupKFold cross-validation (5 folds, grouped by question ID) to ensure no question appears in both train and test sets. All preprocessing (standardization, PLS projection) is fit on training folds only and applied to held-out test folds, preventing any leakage. Nested CV validation. To verify hyperparameter selection (layer, PLS dimension) does not inflate test AUC, we per- form nested cross-validation: outer loop (5-fold) for final evaluation, inner loop (3-fold) for hyperparameter selection. Comparing nested vs. standard CV shows negligible bias: Qwen2-7B +0.005, GPT-2-Large−0.026 (Appendix G). This confirms our reported AUCs are unbiased estimates. Confound controls. Length-balanced evaluation (matched answer lengths), length-residualized probing (regressing out length), and correlation analysis with surface features (L2 norm, mean activation, sparsity). 5. Results 5.1. The Confidence Signal is 3–8 Dimensional Our central finding: the discriminative confidence signal occupies a low-dimensional subspace (Table 1). We use PLS (partial least squares) to estimate the optimal discrimi- native subspace; this measures how many dimensions are needed for classification, not the intrinsic geometry of the full representation. All 9 models peak at 3–8 dimensions, with performance decreasing at higher dimensions. GPT-2 drops from 0.76 (3D) to 0.62 (32D), an 18% reduction. Instruction-tuned models peak at 4–8D; base models peak at 3D. Adding more dimensions hurts, not helps: the confidence signal concentrates in a small subspace, and additional dimensions introduce noise. Why low-dimensional? The 3–8D peak suggests correct- ness detection relies on a small number of independent fea- tures. This is consistent with prior work identifying discrete confidence-related mechanisms: retrieval success, response coherence, and factual consistency (Stolfo et al., 2024). The 4 The Confidence Manifold variation across models (3D for GPT-2, 4–8D for larger models) may reflect differences in how these features are en- coded, though confirming this requires mechanistic analysis beyond our scope. 5.2.Linear Separability in the Discriminative Subspace Given this low-dimensional structure, can geometric classi- fiers exploit non-linear patterns within the PLS subspace? We test convex hull classification (Yang et al., 2004), Maha- lanobis distance, k-nearest neighbors, and kernel SVM in 8D PLS space (Table 2). Across all 9 models, no geometric method improves over linear probes. This is informative: within the dominant discriminative subspace, the confidence signal is linearly separable. Complex decision boundaries provide no ben- efit because the optimal boundary is simply a hyperplane between class centroids. 5.3. Centroid Distance Enables Minimal-Supervision Detection Centroid distance matches discriminative probing. Cen- troid distance in PLS space matches linear probe per- formance.This generative approach achieves parity with discriminative learning:conf(x) = exp(−∥x − μ correct ∥)/(exp(−∥x− μ correct ∥) + exp(−∥x− μ incorrect ∥)). The theoretical implication: class separation is dominated by a mean shift between correct and incorrect distributions, not by covariance differences. This explains why linear methods suffice: the optimal decision boundary is perpen- dicular to the line connecting centroids. However, some la- bels are necessary: the best unsupervised feature (L2 norm, Table 3) achieves only 0.62 AUC on Mistral-7B, far below supervised methods (0.92). Few-shot label efficiency. How many labels are needed? We evaluate label budgets on GPT-2 (Appendix E.4). With justN = 5examples per class, centroid achieves 0.60 AUC; atN = 25, 0.69 AUC (89% of full-data); atN = 100, 0.76 AUC. Centroid matches or exceeds probe at all budgets. Given a learned subspace, detection requires only two mean vectors. 5.4. Model Comparison Table 4 presents full model comparison. Key findings: (1) Instruction-tuned models achieve higher AUC (0.91–0.97) vs GPT-2 family (0.80–0.84). (2) Optimal depth varies: GPT-2 peaks at final layers (100%), instruction-tuned at mid-layers (43–75%). (3) Optimal dimensionality is 3–8D across all models. Universal structure across architectures. The 3–8D opti- mal dimensionality emerges consistently across all 9 mod- els despite 10×parameter variation (124M–7B). Intrinsic dimension compresses from 20–55D at early layers to 8– 12D at optimal layers, a 40–60% reduction. Early layers show high cross-model variance (std 0.28); late layers con- verge (std 0.11). This architecture-agnostic convergence indicates a shared computational structure for confidence encoding. However, lower dimension correlates with higher AUC (r = −0.43) but explains only 18% of variance (R 2 = 0.18). Subspace orientation matters more, explaining why PLS (supervised) outperforms unsupervised reduction (Appendix B.1). Confounds. Length-only probe achieves 0.54 AUC (r = −0.016 ,p = 0.52); length balancing reduces AUC by only 1.3%. Correlations with embedding statistics (L2 norm, mean activation, sparsity) are all low (|r| < 0.15), confirm- ing probes detect semantic content, not surface features. Paraphrase control. To verify probes detect correctness rather than answer style, we test with paraphrased answers. For each correct/incorrect answer, we create 5 paraphrase variants and analyze variance on 817 TruthfulQA questions. The F-ratio of 17.40 (between/within-answer variance) con- firms correctness drives separation 17×more strongly than paraphrase style, with GroupKFold test AUC of 0.926 on unseen paraphrased questions (Appendix H). 5.5. Generalization and Limitations Cross-dataset transfer. Probes trained on TruthfulQA gen- eralize to other factual QA domains (Table 5). Instruction- tuned models show robust transfer: Qwen2-7B achieves 0.69 cross-domain AUC (vs 0.90 in-domain). GPT-2 family shows weaker transfer (0.51–0.56). We exclude HaluEval from the average because it tests summarization faithfulness rather than factual correctness; transfer to HaluEval is be- low chance (0.18–0.47), indicating our signal is specific to factual QA. Cross-architecture transfer. Within GPT-2: small-to-large transfer retains 92% signal, while large-to-small retains only 73%. Cross-architecture transfer (GPT-2 to Qwen2-7B) shows 54–58% retention, suggesting larger models encode confidence in ways smaller models cannot represent. 5.6. Why Internal Representations, Not Outputs? A natural question: why probe internal representations over output-based uncertainty measures? We systematically com- pare against output-based methods (Table 6). Output-based methods (P(True), token entropy, semantic en- tropy (Farquhar et al., 2024), and CCS (Burns et al., 2023)) achieve 0.44–0.64 AUC (median 0.51, near chance). In- ternal probes achieve 0.80–0.94 AUC. This confirms that confidence information exists in internal representations but is not accessible from model outputs. Uncertainty̸=cor- 5 The Confidence Manifold Table 2. Geometric classifiers in 8D PLS space (GroupKFold AUC). NCH = Nearest Convex Hull. No method consistently outperforms linear probes; within this discriminative subspace, the signal is linearly separable. ModelLinearCentroidMahal.NCHKNNSVM Instruction-tuned Llama-1B0.881 ±.016 0.885 ±.012 0.880 ±.015 0.849 ±.014 0.864 ±.018 0.857 ±.025 Qwen2-1.5B0.859 ±.045 0.859 ±.029 0.858 ±.036 0.808 ±.038 0.847 ±.031 0.842 ±.036 Gemma-2B0.846 ±.021 0.851 ±.022 0.846 ±.032 0.818 ±.027 0.817 ±.024 0.822 ±.026 Llama-3B0.922 ±.019 0.919 ±.017 0.918 ±.023 0.900 ±.019 0.903 ±.022 0.906 ±.029 Qwen2-7B0.885 ±.019 0.891 ±.017 0.887 ±.020 0.849 ±.014 0.875 ±.012 0.878 ±.015 Mistral-7B0.890 ±.023 0.897 ±.020 0.893 ±.022 0.851 ±.029 0.883 ±.021 0.880 ±.028 Base models (GPT-2) GPT-20.757 ±.038 0.766 ±.029 0.762 ±.036 0.714 ±.045 0.752 ±.029 0.743 ±.037 GPT-2-Med0.787 ±.023 0.799 ±.020 0.796 ±.020 0.746 ±.033 0.787 ±.023 0.779 ±.020 GPT-2-Large0.777 ±.017 0.797 ±.016 0.791 ±.017 0.754 ±.028 0.779 ±.017 0.773 ±.015 Table 3. Unsupervised features (AUC, no labels). Best is L2 norm but far below supervised. FeatureGPT-2Mistral-7BLlama-1B L2 norm0.508 ±0.003 0.621 ±0.008 0.608 ±0.026 Recon. error0.531 ±0.025 0.556 ±0.008 0.546 ±0.010 LOF score0.520 ±0.017 0.572 ±0.006 0.556 ±0.034 Cluster unc.0.523 ±0.016 0.534 ±0.014 0.557 ±0.029 Local dim0.547 ±0.038 0.578 ±0.045 0.590 ±0.055 Best superv.0.757 ±0.038 0.890 ±0.023 0.881 ±0.016 Table 4. Correctness detection across 9 models (AUC). Layer = optimal layer / total layers. ModelSizeLayerDepthAUC Instruction-tuned Llama-1B1BL8/1656%0.93 ±.02 Qwen2-1.5B1.5BL16/2861%0.91 ±.03 Gemma-2B2BL15/2662%0.93 ±.01 Llama-3B3BL12/2843%0.97 ±.01 Qwen2-7B7BL20/2875%0.94 ±.02 Mistral-7B7BL23/3275%0.92 ±.02 Base models (GPT-2) GPT-2124ML11/12100%0.80 ±.04 GPT-2-Med355ML23/24100%0.84 ±.02 GPT-2-Large774ML35/36100%0.84 ±.02 rectness: semantic entropy achieves only 0.43–0.60 AUC because TruthfulQA contains confidently wrong answers: models assert misconceptions with low uncertainty. SE mea- sures what the model is uncertain about; probes detect what the model is wrong about, distinct signals. 5.7. Causal Validation via Activation Steering To verify that the learned probe direction is causally rele- vant rather than merely correlational, we perform activation steering experiments. Intervening along the confidence di- rection should systematically alter error rates; intervening along control directions should have no effect. Table 5. Cross-dataset generalization (AUC). In-Dom = Truth- fulQA, Cross = avg of SciQ, CSQA, FEVER (factual QA tasks). HaluEval excluded: it tests summarization faithfulness, a different task. ModelIn-DomCrossGap Instruction-tuned Llama-1B0.924 ±.001 0.63 ±.01 0.29 Qwen2-1.5B0.877 ±.001 0.65 ±.01 0.23 Qwen2-7B0.901 ±.002 0.69 ±.01 0.21 Base models (GPT-2) GPT-20.732 ±.001 0.51 ±.01 0.22 GPT-2-Med0.750 ±.001 0.56 ±.01 0.19 GPT-2-Large0.751 ±.001 0.56 ±.01 0.19 Table 6. Internal probes vs output-based methods (GroupKFold AUC). Output-based methods achieve near-chance performance while internal probes achieve 0.80–0.94 AUC. ModelP(T)Ent.SECCSProbe∆ GPT-20.440.510.560.540.80 ±.04 +43% GPT-2-Med0.450.510.600.610.84 ±.02 +38% Mistral-7B0.530.510.55–0.92 ±.02 +67% Qwen2-7B0.640.520.58–0.94 ±.02 +47% Protocol. We modify activations at the optimal layer during generation:h ′ = h + α· ˆ w, where ˆ wis the L2-normalized probe weight vector. The steering vector is added at every token position during generation, following Li et al. (2023b). We sweepα ∈ [−5, 5]and measure error rate on held-out TruthfulQA questions, judged against ground-truth labels (Appendix F). Results. The learned direction produces a monotonic, symmetric effect (Figure 2). Steering toward uncertainty (negativeα) increases error rate from baseline 0.56 to 0.63 atα = −5, while steering toward confidence (positiveα) decreases it to 0.52 atα = +5, yielding a total effect size of 10.9 percentage points. Control directions show minimal effect. Random di- rections (r ∼ N(0,I), normalized) produce mean ef- 6 The Confidence Manifold 42024 Steering Coefficient () 0.475 0.500 0.525 0.550 0.575 0.600 0.625 0.650 0.675 Error Rate Steering toward confidence Steering toward uncertainty Learned confidence direction Random direction Orthogonal direction Figure 2. Steering intervention analysis. Error rate on held-out TruthfulQA questions vs. steering coefficientα∈ [−5, 5]. Interventions modify the forward pass at the optimal layer:h ′ = h + α· ˆ w. The learned confidence direction (green) produces a monotonic 10.9 percentage point swing:α =−5increases error rate to 0.63 (steering toward uncertainty),α = +5decreases it to 0.52 (steering toward confidence). Random directions (gray,r ∼ N (0,I)) and orthogonal directions (orange,r ⊥ ) show no systematic effect, remaining at baseline 0.56. fect+1.8p (p = 0.59). Orthogonal directions (r ⊥ = r − (r ⊤ ˆ w) ˆ w ) produce mean effect+1.6p (p = 0.64). Both indistinguishable from zero, confirming the specificity of the learned direction. Interpretation. The contrast be- tween learned and control directions establishes that we have identified a causally relevant representation, not merely a statistical correlate. The 10.9p effect size is practically significant for applications requiring calibrated confidence. Cross-dataset validation. Table 8 shows probes trained on TruthfulQA transfer to independent datasets: FEVER (0.68–0.76 AUC), SciQ (0.61–0.68 AUC), CSQA (0.57– 0.68 AUC), all significantly exceeding random baseline (0.50). This confirms the confidence direction captures gen- eral correctness signals rather than dataset-specific artifacts. 6. Discussion The Confidence Manifold is Simpler Than Expected. The discriminative signal for correctness concentrates in 3–8 dimensions (Figure 3), substantially lower than the high-dimensional activation space. While the embedding manifold may span 8–12D (as measured by MLE), the classification-relevant structure is simpler. This simplifies the linear representation hypothesis (Park et al., 2024): con- fidence is not just linear, but low-dimensional linear. Three-Phase Processing Structure. Cross-layer similar- ity analysis reveals block-diagonal structure with consis- tent phase boundaries: Phase I (0–30% depth) performs token-level feature extraction with low similarity to later layers; Phase I (30–70%) integrates semantics with gradual probe weight rotation; Phase I (70–100%) encodes sta- ble confidence with high intra-phase coherence (0.81 mean similarity). This architecture-agnostic pattern suggests a universal processing pipeline where confidence crystallizes in middle-to-late layers. Why Geometric Complexity Provides No Benefit. One might expect geometric classifiers (convex hulls, Maha- lanobis distance, kernel methods) to exploit non-linear struc- ture. They do not. The explanation: the confidence signal is linearly separable. The optimal decision boundary is a hyperplane between class centroids; complex boundaries model structure that does not exist. Practitioners need not pursue complex classifiers; a simple centroid-based detector suffices. Anthropic’s Constitutional Classifier++ validates this in production: linear probes on internal activations re- duce jailbreak success from 86% to 4.4% with 1% overhead (Sharma et al., 2025). Centroid-based approaches may sim- plify further by eliminating discriminative training. Centroid Distance Matches Discriminative Learning. Centroid distance in PLS space matches linear probe per- formance (0.90 vs 0.89 AUC on Mistral-7B). This is the- oretically informative: a generative approach (modeling class distributions via means) achieves parity with discrim- inative learning (Ng & Jordan, 2001). Their equivalence indicates well-separated Gaussian-like clusters. Practically, confidence estimation requires only two mean vectors, not 7 The Confidence Manifold 30 20 10 0 10 20 30 PLS1 30 20 10 0 10 20 30 40 PLS2 40 30 20 10 0 10 20 30 PLS3 Qwen2-7B AUC=0.94 Correct Incorrect 50 40 30 20 10 0 10 20 PLS1 40 30 20 10 0 10 20 30 40 PLS2 60 40 20 0 20 40 PLS3 Mistral-7B AUC=0.92 30 20 10 0 10 20 30 PLS1 20 15 10 5 0 5 10 15 20 PLS2 20 10 0 10 20 PLS3 Llama-3B AUC=0.97 15 10 5 0 5 PLS1 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 PLS2 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 PLS3 GPT-2 AUC=0.80 15 10 5 0 5 10 PLS1 12.5 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 PLS2 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 PLS3 GPT-2-Med AUC=0.84 15 10 5 0 5 10 PLS1 10 5 0 5 10 PLS2 15 10 5 0 5 10 PLS3 GPT-2-Large AUC=0.84 Figure 3. 3D PLS visualization of the confidence manifold. Row 1: instruction-tuned models (Qwen2-7B, Mistral-7B, Llama-3B). Row 2: GPT-2 family (base models). Convex hulls show class regions; stars mark centroids. GPT-2 family shows clearer visual separation despite lower AUC (0.80–0.84), while instruction-tuned models achieve higher AUC (0.91–0.97) with more overlap in 3D projection. See Appendix E.1 for smaller instruction-tuned models. a trained classifier. Geometrically: in discriminative coor- dinates, classes form approximately Gaussian clusters with similar covariance; the optimal decision boundary is the perpendicular bisector of centroids. Internal vs Output: A Fundamental Gap. Output-based methods achieve near-chance performance (0.44–0.64 AUC) while internal probes achieve 0.80–0.97 AUC. The infor- mation required for correctness detection exists internally but is not expressed in outputs. This extends findings that LLMs “know more than they show” (Orgad et al., 2025) and confirms that uncertainty̸=correctness (Kuhn et al., 2023; Farquhar et al., 2024): models assert misconceptions. Cross-Domain Transfer. Full-dimensional probes show weak cross-dataset generalization on reasoning tasks (0.47– 0.50 AUC on SciQ/CSQA), memorizing dataset-specific patterns. On Qwen2-7B, projecting to 5D PLS dimensions improves cross-domain transfer by 10–14% absolute AUC (0.61–0.79). Transfer across datasets with different answer formats rules out stylistic artifacts as primary signal. PLS extracts the subspace maximally correlated with correctness, discarding dataset-specific variance. Future Directions. Can online adaptation track generation- time shifts? Characterizing these dynamics would extend our geometric framework to real-time monitoring. Addi- tionally, while TruthfulQA’s paired format controls for topic confounds, extending to naturalistic errors (long-form gen- eration, RAG failures) would broaden applicability. 7. Conclusion Across 9 models from 5 families, correctness representations exhibit consistent geometric properties. The discriminative signal occupies 3–8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance matches probe performance, enabling detection with two mean vectors rather than discriminative training. The correctness signal exists internally (0.80–0.97 AUC) but is not expressed in outputs (0.44–0.64 AUC). Activation steering along the learned direction produces 10.9 percentage point changes in error rates, confirming causal relevance. On Qwen2-7B, PLS dimension reduction improves cross-domain transfer by 10–14% absolute AUC, suggesting the confidence signal generalizes when dataset-specific variance is removed. 8 The Confidence Manifold References Arad, D., Mueller, A., and Belinkov, Y. SAEs are good for steering – if you select the right features.In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 10241–10259, Suzhou, China, November 2025. Associa- tion for Computational Linguistics. ISBN 979-8-89176- 332-6. doi: 10.18653/v1/2025.emnlp-main.519. URL https://aclanthology.org/2025.emnlp-m ain.519/. Azaria, A. and Mitchell, T. The internal state of an LLM knows when it’s lying. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Compu- tational Linguistics: EMNLP 2023, p. 967–976, Sin- gapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URLhttps://aclanthology.org/2023.fi ndings-emnlp.68/. Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/colia00422. URLhttps: //aclanthology.org/2022.cl-1.7/. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer- circuits.pub/2023/monosemantic-features/index.html. Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview .net/forum?id=ETKGuby0hcs. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models, 2023. URLhttps://ar xiv.org/abs/2309.08600. Engels, J., Michaud, E. J., Liao, I., Gurnee, W., and Tegmark, M. Not all language model features are one- dimensionally linear. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=d63a 4AM4hb. Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625–630, June 2024. ISSN 1476-4687. doi: 10.1038/s41586-024-07421-0. URL https://doi.org/10.1038/s41586-024-0 7421-0. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Wyatt, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Guzm ́ an, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Thattai, G., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Ko- revaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I., Misra, I., Evtimov, I., Zhang, J., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Prasad, K., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., El-Arini, K., Iyer, K., Malik, K., Chiu, K., Bhalla, K., Lakhotia, K., Rantala-Yeary, L., van der Maaten, L., Chen, L., Tan, L., Jenkins, L., Martin, L., Madaan, L., Malo, L., Blecher, L., Landzaat, L., de Oliveira, L., Muzzi, M., Pasupuleti, M., Singh, M., Paluri, M., Kardas, M., Tsimpoukelli, M., Oldham, M., Rita, M., Pavlova, M., Kambadur, M., Lewis, M., Si, M., Singh, M. K., Hassan, M., Goyal, N., Torabi, N., Bashlykov, N., Bogoychev, N., Chatterji, N., Zhang, N., Duchenne, O.,C ̧elebi, O., Alrassy, P., Zhang, P., Li, P., Vasic, P., Weng, P., Bhargava, P., Dubal, P., Krishnan, P., Koura, P. S., Xu, P., He, Q., Dong, Q., Srinivasan, R., Ganapathy, R., Calderer, R., Cabral, R. S., Stojnic, R., Raileanu, R., Maheswari, R., Girdhar, R., Patel, R., Sauvestre, R., Polidoro, R., Sumbaly, R., Taylor, R., Silva, R., Hou, R., Wang, R., Hosseini, S., Chennabasappa, S., Singh, S., Bell, S., Kim, S. S., Edunov, S., Nie, S., Narang, S., Raparthy, S., Shen, S., Wan, S., Bhosale, S., Zhang, S., Vandenhende, S., Batra, S., Whitman, S., Sootla, S., Collot, S., Gururangan, S., Borodinsky, S., Herman, T., Fowler, T., Sheasha, T., Georgiou, T., Scialom, T., Speck- bacher, T., Mihaylov, T., Xiao, T., Karn, U., Goswami, V., Gupta, V., Ramanathan, V., Kerkez, V., Gonguet, V., Do, V., Vogeti, V., Albiero, V., Petrovic, V., Chu, W., Xiong, W., Fu, W., Meers, W., Martinet, X., Wang, X., Wang, X., Tan, X. E., Xia, X., Xie, X., Jia, X., Wang, X., Gold- schlag, Y., Gaur, Y., Babaei, Y., Wen, Y., Song, Y., Zhang, 9 The Confidence Manifold Y., Li, Y., Mao, Y., Coudert, Z. D., Yan, Z., Chen, Z., Papakipos, Z., Singh, A., Srivastava, A., Jain, A., Kelsey, A., Shajnfeld, A., Gangidi, A., Victoria, A., Goldstand, A., Menon, A., Sharma, A., Boesenberg, A., Baevski, A., Feinstein, A., Kallet, A., Sangani, A., Teo, A., Yunus, A., Lupu, A., Alvarado, A., Caples, A., Gu, A., Ho, A., Poul- ton, A., Ryan, A., Ramchandani, A., Dong, A., Franco, A., Goyal, A., Saraf, A., Chowdhury, A., Gabriel, A., Bharambe, A., Eisenman, A., Yazdan, A., James, B., Maurer, B., Leonhardi, B., Huang, B., Loyd, B., Paola, B. D., Paranjape, B., Liu, B., Wu, B., Ni, B., Hancock, B., Wasti, B., Spence, B., Stojkovic, B., Gamido, B., Montalvo, B., Parker, C., Burton, C., Mejia, C., Liu, C., Wang, C., Kim, C., Zhou, C., Hu, C., Chu, C.-H., Cai, C., Tindal, C., Feichtenhofer, C., Gao, C., Civin, D., Beaty, D., Kreymer, D., Li, D., Adkins, D., Xu, D., Testuggine, D., David, D., Parikh, D., Liskovich, D., Foss, D., Wang, D., Le, D., Holland, D., Dowling, E., Jamil, E., Mont- gomery, E., Presani, E., Hahn, E., Wood, E., Le, E.-T., Brinkman, E., Arcaute, E., Dunbar, E., Smothers, E., Sun, F., Kreuk, F., Tian, F., Kokkinos, F., Ozgenel, F., Cag- gioni, F., Kanayet, F., Seide, F., Florez, G. M., Schwarz, G., Badeer, G., Swee, G., Halpern, G., Herman, G., Sizov, G., Guangyi, Zhang, Lakshminarayanan, G., Inan, H., Shojanazeri, H., Zou, H., Wang, H., Zha, H., Habeeb, H., Rudolph, H., Suk, H., Aspegren, H., Goldman, H., Zhan, H., Damlaj, I., Molybog, I., Tufanov, I., Leontiadis, I., Veliche, I.-E., Gat, I., Weissman, J., Geboski, J., Kohli, J., Lam, J., Asher, J., Gaya, J.-B., Marcus, J., Tang, J., Chan, J., Zhen, J., Reizenstein, J., Teboul, J., Zhong, J., Jin, J., Yang, J., Cummings, J., Carvill, J., Shepard, J., McPhie, J., Torres, J., Ginsburg, J., Wang, J., Wu, K., U, K. H., Saxena, K., Khandelwal, K., Zand, K., Matosich, K., Veeraraghavan, K., Michelena, K., Li, K., Jagadeesh, K., Huang, K., Chawla, K., Huang, K., Chen, L., Garg, L., A, L., Silva, L., Bell, L., Zhang, L., Guo, L., Yu, L., Moshkovich, L., Wehrstedt, L., Khabsa, M., Avalani, M., Bhatt, M., Mankus, M., Hasson, M., Lennie, M., Reso, M., Groshev, M., Naumov, M., Lathi, M., Keneally, M., Liu, M., Seltzer, M. L., Valko, M., Restrepo, M., Patel, M., Vyatskov, M., Samvelyan, M., Clark, M., Macey, M., Wang, M., Hermoso, M. J., Metanat, M., Rastegari, M., Bansal, M., Santhanam, N., Parks, N., White, N., Bawa, N., Singhal, N., Egebo, N., Usunier, N., Mehta, N., Laptev, N. P., Dong, N., Cheng, N., Chernoguz, O., Hart, O., Salpekar, O., Kalinli, O., Kent, P., Parekh, P., Saab, P., Balaji, P., Rittner, P., Bontrager, P., Roux, P., Dollar, P., Zvyagina, P., Ratanchandani, P., Yuvraj, P., Liang, Q., Alao, R., Rodriguez, R., Ayub, R., Murthy, R., Nayani, R., Mitra, R., Parthasarathy, R., Li, R., Hogan, R., Battey, R., Wang, R., Howes, R., Rinott, R., Mehta, S., Siby, S., Bondu, S. J., Datta, S., Chugh, S., Hunt, S., Dhillon, S., Sidorov, S., Pan, S., Mahajan, S., Verma, S., Yamamoto, S., Ramaswamy, S., Lindsay, S., Lindsay, S., Feng, S., Lin, S., Zha, S. C., Patil, S., Shankar, S., Zhang, S., Zhang, S., Wang, S., Agarwal, S., Sajuyigbe, S., Chintala, S., Max, S., Chen, S., Kehoe, S., Satter- field, S., Govindaprasad, S., Gupta, S., Deng, S., Cho, S., Virk, S., Subramanian, S., Choudhury, S., Goldman, S., Remez, T., Glaser, T., Best, T., Koehler, T., Robinson, T., Li, T., Zhang, T., Matthews, T., Chou, T., Shaked, T., Vontimitta, V., Ajayi, V., Montanez, V., Mohan, V., Kumar, V. S., Mangla, V., Ionescu, V., Poenaru, V., Mi- hailescu, V. T., Ivanov, V., Li, W., Wang, W., Jiang, W., Bouaziz, W., Constable, W., Tang, X., Wu, X., Wang, X., Wu, X., Gao, X., Kleinman, Y., Chen, Y., Hu, Y., Jia, Y., Qi, Y., Li, Y., Zhang, Y., Zhang, Y., Adi, Y., Nam, Y., Yu, Wang, Zhao, Y., Hao, Y., Qian, Y., Li, Y., He, Y., Rait, Z., DeVito, Z., Rosnbrick, Z., Wen, Z., Yang, Z., Zhao, Z., and Ma, Z. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URLhttp s://openreview.net/forum?id=JYs1R9IM Jr. Han, J., Kossen, J., Razzak, M., Schut, L., Malik, S. A., and Gal, Y. Semantic entropy probes: Robust and cheap hal- lucination detection in LLMs. In ICML 2024 Workshop on Foundation Models in the Wild, 2024. URLhttps: //openreview.net/forum?id=Zd0XLr6JKn. Ji, J., Wang, K., Qiu, T. A., Chen, B., Zhou, J., Li, C., Lou, H., Dai, J., Liu, Y., and Yang, Y. Language models resist alignment: Evidence from data compression. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 23411–23432, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long. 1141. URLhttps://aclanthology.org/2025. acl-long.1141/. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.- A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023. URLhttps: //arxiv.org/abs/2310.06825. Joshi, A., Ahmad, A., and Modi, A. Calibration across layers: Understanding calibration evolution in LLMs. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 10 The Confidence Manifold 14686–14714, Suzhou, China, November 2025. Associa- tion for Computational Linguistics. ISBN 979-8-89176- 332-6. doi: 10.18653/v1/2025.emnlp-main.742. URL https://aclanthology.org/2025.emnlp-m ain.742/. Kissane, C., Krzyzanowski, R., Conmy, A., and Nanda, N. Saes (usually) transfer between base and chat models. Alignment Forum, 2024. URLhttps://w.alig nmentforum.org/posts/fmwk6qxrpW8d4jv bd/saes-usually-transfer-between-bas e-and-chat-models. Kuhn, L., Gal, Y., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in nat- ural language generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VD-A YtP0dve. Levina, E. and Bickel, P. Maximum likelihood estimation of intrinsic dimension. In Saul, L., Weiss, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems, volume 17. MIT Press, 2004. URLhttps: //proceedings.neurips.c/paper_files /paper/2004/file/74934548253bcab8490 ebd74afed7031-Paper.pdf. Li, J., Cheng, X., Zhao, X., Nie, J.-Y., and Wen, J.- R. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, p. 6449–6464, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.397. URLhttps://ac lanthology.org/2023.emnlp-main.397/. Li, K., Patel, O., Vi ́ egas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=aLLu Ypn83y. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramar, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. In Belinkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., and Chen, H. (eds.), Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, p. 278–300, Miami, Florida, US, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.blackboxnlp- 1.19. URLhttps: //aclanthology.org/2024.blackboxnlp-1 .19/. Lin, S., Hilton, J., and Evans, O. TruthfulQA: Measuring how models mimic human falsehoods. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), p. 3214– 3252, Dublin, Ireland, May 2022. Association for Compu- tational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URLhttps://aclanthology.org/2022.ac l-long.229/. Marks, S. and Tegmark, M. The geometry of truth: Emer- gent linear structure in large language model represen- tations of true/false datasets. In First Conference on Language Modeling, 2024. URLhttps://openre view.net/forum?id=aajyHYjjsk. Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview .net/forum?id=I4e82CIDxv. Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations (ICLR), 2013. URLhttps://op enreview.net/forum?id=idpCdOWtqXd60 . Poster. Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh, P., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, p. 12076–12100, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.e mnlp-main.741. URLhttps://aclanthology.o rg/2023.emnlp-main.741/. Nanda, N., Lee, A., and Wattenberg, M. Emergent linear rep- resentations in world models of self-supervised sequence models. In Belinkov, Y., Hao, S., Jumelet, J., Kim, N., McCarthy, A., and Mohebbi, H. (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, p. 16–30, Singapore, Decem- ber 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.2. URLhttps://ac lanthology.org/2023.blackboxnlp-1.2/. Ng, A. and Jordan, M. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Dietterich, T., Becker, S., and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001. URLhttps: 11 The Confidence Manifold //proceedings.neurips.c/paper_files /paper/2001/file/7b7a53e239400a13bd6 be6c91c4f6c4e-Paper.pdf. Orgad, H., Toker, M., Gekhman, Z., Reichart, R., Szpek- tor, I., Kotek, H., and Belinkov, Y. LLMs know more than they show: On the intrinsic representation of LLM hallucinations. In The Thirteenth International Confer- ence on Learning Representations, 2025. URLhttps: //openreview.net/forum?id=KRnsX5Em3W. Park, K., Choe, Y. J., Jiang, Y., and Veitch, V. The geometry of categorical and hierarchical concepts in large language models. In ICML 2024 Workshop on Mechanistic Inter- pretability, 2024. URLhttps://openreview.net /forum?id=KXuYjuBzKo. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. URLhttps: //cdn.openai.com/better-language-mod els/language_models_are_unsupervised _multitask_learners.pdf. Rawte, V., Chakraborty, S., Pathak, A., Sarkar, A., Tonmoy, S. T. I., Chadha, A., Sheth, A., and Das, A. The troubling emergence of hallucination in large language models - an extensive definition, quantification, and prescriptive re- mediations. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, p. 2541–2573, Sin- gapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.155. URLhttps://aclanthology.org/2023.em nlp-main.155/. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. Steering llama 2 via contrastive activa- tion addition. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 15504–15522, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl- long.828. URLhttps: //aclanthology.org/2024.acl-long.828/. Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Good- friend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bow- man, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofgren, P., Mosconi, F., O’Hara, C., Olsson, C., Petrini, L., Rajani, S., Saxena, N., Silverstein, A., Singh, T., Sumers, T., Tang, L., Troy, K. K., Weisser, C., Zhong, R., Zhou, G., Leike, J., Kaplan, J., and Perez, E. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming, 2025. URL https://arxiv.org/abs/2501.18837. Sriramanan, G., Bharti, S., Sadasivan, V. S., Saha, S., Kattakinda, P., and Feizi, S. LLM-check: Investigat- ing detection of hallucinations in large language mod- els. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps: //openreview.net/forum?id=LYx4w3CAgy. Stolfo, A., Wu, B., Gurnee, W., Belinkov, Y., Song, X., Sachan, M., and Nanda, N. Confidence regulation neu- rons in language models. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.), Advances in Neural Information Processing Sys- tems, volume 37, p. 125019–125049. Curran Associates, Inc., 2024. doi: 10.52202/079017-3970. URLhttps: //proceedings.neurips.c/paper_files /paper/2024/file/e21955c93dede886af1 d0d362c756757-Paper-Conference.pdf. Su, W., Wang, C., Ai, Q., Hu, Y., Wu, Z., Zhou, Y., and Liu, Y. Unsupervised real-time hallucination detection based on the internal states of large language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Find- ings of the Association for Computational Linguistics: ACL 2024, p. 14379–14391, Bangkok, Thailand, Au- gust 2024. Association for Computational Linguistics. doi: 1 0 . 1 8 6 5 3 / v 1 / 2 0 2 4 . f i n d i n g s- a c l . 8 54.URL https://aclanthology.org/2024.findin gs-acl.854/. Sun, X., Stolfo, A., Engels, J., Wu, B. P., Rajamanoharan, S., Sachan, M., and Tegmark, M. Dense SAE latents are features, not bugs. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=p8lK cNkJRi. Talmor, A., Herzig, J., Lourie, N., and Berant, J. Com- monsenseQA: A question answering challenge targeting commonsense knowledge. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p. 4149–4158, Min- neapolis, Minnesota, June 2019. Association for Compu- tational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421/. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram ́ e, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsit- sulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., Grill, J.-B., Neyshabur, B., 12 The Confidence Manifold Bachem, O., Walton, A., Severyn, A., Parrish, A., Ahmad, A., Hutchison, A., Abdagic, A., Carl, A., Shen, A., Brock, A., Coenen, A., Laforge, A., Paterson, A., Bastian, B., Piot, B., Wu, B., Royal, B., Chen, C., Kumar, C., Perry, C., Welty, C., Choquette-Choo, C. A., Sinopalnikov, D., Weinberger, D., Vijaykumar, D., Rogozi ́ nska, D., Her- bison, D., Bandy, E., Wang, E., Noland, E., Moreira, E., Senter, E., Eltyshev, E., Visin, F., Rasskin, G., Wei, G., Cameron, G., Martins, G., Hashemi, H., Klimczak- Pluci ́ nska, H., Batra, H., Dhand, H., Nardini, I., Mein, J., Zhou, J., Svensson, J., Stanway, J., Chan, J., Zhou, J. P., Carrasqueira, J., Iljazi, J., Becker, J., Fernandez, J., van Amersfoort, J., Gordon, J., Lipschultz, J., Newlan, J., yeong Ji, J., Mohamed, K., Badola, K., Black, K., Milli- can, K., McDonell, K., Nguyen, K., Sodhia, K., Greene, K., Sjoesund, L. L., Usui, L., Sifre, L., Heuermann, L., Lago, L., McNealus, L., Soares, L. B., Kilpatrick, L., Dixon, L., Martins, L., Reid, M., Singh, M., Iverson, M., G ̈ orner, M., Velloso, M., Wirth, M., Davidow, M., Miller, M., Rahtz, M., Watson, M., Risdal, M., Kazemi, M., Moynihan, M., Zhang, M., Kahng, M., Park, M., Rahman, M., Khatwani, M., Dao, N., Bardoliwalla, N., Devanathan, N., Dumai, N., Chauhan, N., Wahltinez, O., Botarda, P., Barnes, P., Barham, P., Michel, P., Jin, P., Georgiev, P., Culliton, P., Kuppala, P., Comanescu, R., Merhej, R., Jana, R., Rokni, R. A., Agarwal, R., Mullins, R., Saadat, S., Carthy, S. M., Cogan, S., Per- rin, S., Arnold, S. M. R., Krause, S., Dai, S., Garg, S., Sheth, S., Ronstrom, S., Chan, S., Jordan, T., Yu, T., Eccles, T., Hennigan, T., Kocisky, T., Doshi, T., Jain, V., Yadav, V., Meshram, V., Dharmadhikari, V., Barkley, W., Wei, W., Ye, W., Han, W., Kwon, W., Xu, X., Shen, Z., Gong, Z., Wei, Z., Cotruta, V., Kirk, P., Rao, A., Giang, M., Peran, L., Warkentin, T., Collins, E., Bar- ral, J., Ghahramani, Z., Hadsell, R., Sculley, D., Banks, J., Dragan, A., Petrov, S., Vinyals, O., Dean, J., Has- sabis, D., Kavukcuoglu, K., Farabet, C., Buchatskaya, E., Borgeaud, S., Fiedel, N., Joulin, A., Kenealy, K., Dadashi, R., and Andreev, A. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., Mac- Diarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Extracting interpretable fea- tures from claude 3 sonnet. Transformer Circuits Thread, 2024. URLhttps://transformer-circuits. pub/2024/scaling-monosemanticity/ind ex.html. Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mit- tal, A. FEVER: a large-scale dataset for fact extrac- tion and VERification. In Walker, M., Ji, H., and Stent, A. (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Volume 1 (Long Papers), p. 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.doi: 1 0 . 1 8 6 5 3 / v 1 / N 1 8- 1 0 7 4.URL https://aclanthology.org/N18-1074/. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering, 2024. URLhttps: //arxiv.org/abs/2308.10248. Wang, T., Jiao, X., He, Y., Chen, Z., Zhu, Y., Chu, X., Gao, J., Liu, Y., et al. Adaptive activation steering: A tuning- free llm truthfulness improvement method for diverse hallucinations categories. In Proceedings of the ACM Web Conference 2025, 2025. URLhttps://arxiv. org/abs/2406.00034. Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing multiple choice science questions. In Derczynski, L., Xu, W., Ritter, A., and Baldwin, T. (eds.), Proceedings of the 3rd Workshop on Noisy User-generated Text, p. 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URLhttps://aclanthology.org/W17-441 3/. Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., Dong, G., Wei, H., Lin, H., Tang, J., Wang, J., Yang, J., Tu, J., Zhang, J., Ma, J., Yang, J., Xu, J., Zhou, J., Bai, J., He, J., Lin, J., Dang, K., Lu, K., Chen, K., Yang, K., Li, M., Xue, M., Ni, N., Zhang, P., Wang, P., Peng, R., Men, R., Gao, R., Lin, R., Wang, S., Bai, S., Tan, S., Zhu, T., Li, T., Liu, T., Ge, W., Deng, X., Zhou, X., Ren, X., Zhang, X., Wei, X., Ren, X., Liu, X., Fan, Y., Yao, Y., Zhang, Y., Wan, Y., Chu, Y., Liu, Y., Cui, Z., Zhang, Z., Guo, Z., and Fan, Z. Qwen2 technical report, 2024. URL https://arxiv.org/abs/2407.10671. Yang, M.-H., Kriegman, D., and Ahuja, N. Nearest convex hull classification. Pattern Recognition Letters, 25(5): 637–646, 2004. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation engineering: A top-down approach to ai transparency, 2023. URL https://arxiv.org/abs/2310.01405. 13 The Confidence Manifold A. Reproducibility and Implementation A.1. Seed Robustness All experiments use multiple random seeds (42, 123, 456) controlling: (1) stratified train/test splits, (2) GroupKFold assignment by question ID, and (3) stochastic generation for semantic entropy. We report mean±std where variance is non-trivial. Variance analysis. Most metrics show low variance: probe AUC std< 0.02for 8/9 models, in-domain transfer std< 0.01. Higher variance appears in Mistral-7B/Gemma-2B cross-dataset transfer (std≈ 0.20), potentially reflecting dataset-model mismatch. All key comparisons (probe vs. entropy) achievep < 0.05after Bonferroni correction via paired t-tests across seeds. A.2. Implementation Details Models. GPT-2 (124M, 12L), GPT-2-Medium (355M, 24L), GPT-2-Large (774M, 36L), Qwen2-1.5B-Instruct (28L), Qwen2-7B-Instruct (28L), Llama-3.2-1B-Instruct (16L), Llama-3.2-3B-Instruct (28L), Mistral-7B-Instruct-v0.3 (32L), Gemma-2-2B-it (26L). Extraction. Residual stream activations at last-token position following Gurnee et al. (2023). PLS dimension reduction treats labels as continuous targets (0/1). Probing. Logistic regression (C = 0.1); 640–1,280 samples per dataset; 5-fold GroupKFold CV grouped by question ID to prevent data leakage. Layer and PLS dimension selection use CV test AUC (averaged across held-out folds), not train AUC, ensuring no information leakage from hyperparameter selection. SAE analysis. SAELens with Neuronpedia pretrained GPT-2 models (24,576 features, 32× expansion). B. Geometric Analysis of the Confidence Manifold This section provides detailed geometric characterization of how confidence representations evolve through transformer layers. Our analysis reveals three key findings: (1) intrinsic dimension follows a compression pattern: initially expanding in early layers (peaking around 10–20% depth) before decreasing through middle and late layers, (2) probe weight similarity exhibits block-diagonal structure indicating distinct processing phases, and (3) dimension alone explains only 18% of classification variance: the orientation of the low-dimensional subspace matters more than its dimensionality. B.1. Universal Compression Pattern Figure 4 synthesizes geometric properties across all 9 models, revealing architecture-agnostic patterns in how confidence is encoded. Compression dynamics. Intrinsic dimension (estimated via Levina-Bickel MLE (Levina & Bickel, 2004)) compresses from 20–55D at early layers to 8–12D at optimal layers, a 40–60% reduction. This compression is model-agnostic: GPT-2, Mistral, Qwen, Llama, and Gemma families all converge to similar final dimensionality despite vastly different training procedures and scales. Early layers (0–20% depth) show high cross-model variance (std = 0.28 normalized units); late layers converge (std = 0.11), suggesting that while models initialize representations differently, they converge to similar compressed confidence encodings. Dimension vs. performance. The negative correlation (r =−0.43) between intrinsic dimension and probe AUC initially suggests that lower-dimensional representations yield better classification. However, the weakR 2 = 0.18reveals that dimension is necessary but not sufficient: the orientation of the low-dimensional subspace relative to the correct/incorrect decision boundary matters more than its raw dimensionality. A 10D subspace aligned with the confidence direction outperforms a 5D subspace misaligned with it. This explains why PLS outperforms unsupervised dimension reduction: PLS finds the 3–5D subspace that maximizes class separation, not merely the directions of highest variance. Three-phase processing. The averaged cross-layer similarity matrix (Figure 4c) reveals block-diagonal structure with consistent phase boundaries across architectures: • Phase I (0–30% depth): Token-level feature extraction; low similarity to later layers (mean cross-phase similarity: 0.18) 14 The Confidence Manifold 020406080100 Layer Depth (%) 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Dimension (a) Universal Compression Pattern Mean 20406080100 Intrinsic Dimension 0.4 0.5 0.6 0.7 0.8 Probe AUC (b) Dimension-Performance Correlation r=-0.43, p=0.000 0102030 Layer (normalized) 0 5 10 15 20 25 30 Layer (normalized) (c) Averaged Similarity Structure 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4. Universal geometric patterns across architectures. (a) Normalized intrinsic dimension (MLE) by layer depth. All models compress from early to late layers (mean curve in black), with peak dimension at 10–20% depth. (b) Dimension-performance correlation: lower intrinsic dimension correlates with higher probe AUC (r =−0.43,p < 0.001), butR 2 = 0.18indicates dimension explains less than one-fifth of variance; classification utility depends on direction, not dimensionality. (c) Cross-layer probe weight similarity averaged across models shows three-phase block-diagonal structure with phase boundaries at 30% and 70% depth. • Phase I (30–70% depth): Semantic integration; gradual probe weight rotation (mean within-phase similarity: 0.58) • Phase I (70–100% depth): Stable confidence encoding; high intra-phase coherence (mean within-phase similarity: 0.81) B.2. Architecture-Specific Dimension Evolution While the compression pattern is universal, architecture-specific variations provide insights into how different models encode confidence. 020406080100 Layer Depth (%) 0 20 40 60 80 100 Intrinsic Dimension (MLE) (a) Dimension Evolution Across Models GPT-2-Med GPT-2-Large Mistral-7B Qwen2-7B 020406080100 Layer Depth (%) 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Dimension (b) Normalized Dimension (All Models Aligned) Figure 5. Intrinsic dimension evolution by architecture. (a) Raw MLE estimates show all models compress from 20–55D to 8–12D, except Mistral-7B which exhibits late-layer expansion (80–100D at 90%+ depth). (b) Normalized dimension enables cross-model comparison: models follow a common compression trajectory until 80% depth, after which Mistral diverges due to unembedding preparation. Universal compression, model-specific timing. Figure 5 tracks intrinsic dimension through network depth for four representative models. GPT-2 family shows earlier phase transitions (boundaries at 25%/60%) compared to instruction-tuned 15 The Confidence Manifold models (30%/70%), suggesting instruction-tuning prolongs rich intermediate representations before final compression. Mistral anomaly. Mistral-7B exhibits late-layer dimension expansion from 25D at L25 to 80+D at L30–32. This expansion correlates with unembedding preparation: Mistral’s architecture appears to re-expand representations before projecting to vocabulary space. This expansion does not improve classification; Mistral’s optimal layer is L23 (72% depth), before the expansion begins. This confirms that the confidence signal crystallizes in middle layers; late-layer expansion serves output generation, not confidence encoding. B.3. Confidence Landscape To understand the interaction between layer selection and dimensionality, we visualize the full parameter space for our best-performing model. 0 5 10 15 20 25 30 Layer 20 40 60 80 100 120 PLS Dimension 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Detection AUC 0.1 0.2 0.3 0.4 0.5 0.6 AUC Figure 6. AUC surface over layer and PLS dimension for Mistral-7B. The surface shows probe performance as a function of layer depth (x-axis, 0–32) and PLS dimension (y-axis, 1–120). Color indicates AUC; red line traces maximum per layer. Peak performance (0.90 AUC) occurs at layer 23, dimension 5. The ridge structure running through 3–8D across all layers demonstrates that optimal dimensionality is stable; layer choice is the critical hyperparameter. Ridge structure. A clear AUC ridge runs through the surface at 3–8 PLS dimensions across all layers. Performance degrades sharply below 2D (insufficient capacity to capture the signal) and above 16D (overfitting to training noise). The ridge is narrower at early layers (optimal: 3–4D) and broader at late layers (optimal: 4–8D), reflecting increased signal-to-noise ratio in later representations. Layer dominates dimension. Quantifying the relative importance: fixing dimension at 5D, AUC varies from 0.52 (L0) to 0.92 (L23), a 77% relative improvement. Fixing layer at L23, AUC varies from 0.86 (1D) to 0.92 (5D), only 7% improvement. Layer selection is 11×more impactful than dimension selection. This motivates our recommendation to tune layer first, then dimension, rather than joint optimization. 16 The Confidence Manifold C. Geometric Classification Methods Given the low-dimensional structure revealed in Section B, we investigate whether geometric classifiers can exploit non- linear patterns within the PLS subspace. Our negative result (geometric methods do not outperform linear probes) provides evidence that within this dominant discriminative subspace, the confidence signal is linearly separable. C.1. Method Comparison Probe AUC: 0.773, r: 0.473 Centroid AUC: 0.771, r: 0.469 Density AUC: 0.720, r: 0.381 Knn AUC: 0.748, r: 0.434 Ensemble AUC: 0.774, r: 0.474 0.50.60.70.80.91.0 AUC Probe Centroid Density Knn Ensemble Method Comparison gpt2 - Geometric Confidence Estimation Figure 7. Geometric confidence estimation methods on GPT-2 (8D PLS space). Five approaches: linear probe (0.773 AUC), centroid distance (0.771), local density (0.701), KNN-10 (0.748), and ensemble (0.764). Scatter plots show 2D projections colored by P(Factual); stars indicate class centroids. The near-equivalence of probe and centroid methods confirms confidence is encoded as a mean shift, not a complex boundary. Density estimation fails because classes differ in location, not density. We evaluate five geometric approaches in 8D PLS space (Figure 7): Linear probe (0.773 AUC). Standard logistic regression on PLS-reduced activations. Serves as the discriminative baseline. Centroid distance (0.771 AUC). Generative approach: classify based on distance to class centroids. Performance nearly matches the discriminative probe, confirming that class means capture the discriminative signal. The decision boundary is perpendicular to the line connecting centroids, geometrically equivalent to the probe’s learned hyperplane. Local density (0.701 AUC). Estimate confidence via kernel density ratiop(correct|x)/p(incorrect|x)using Gaussian KDE with Scott’s rule bandwidth selection. Underperforms because correct and incorrect distributions have similar local densities: they differ in location (centroid position), not shape (density profile). KNN-10 (0.748 AUC). Classify by majority vote of 10 nearest neighbors. Despite local adaptivity, underperforms linear methods, suggesting the confidence manifold is globally linear rather than exhibiting local curvature. Ensemble (0.764 AUC). Average predictions across methods. No improvement over the best single method, indicating the errors are correlated rather than complementary. Implications. The equivalence of discriminative (probe) and generative (centroid) approaches reveals that confidence is encoded as a simple mean shift in activation space, not a complex decision boundary. This supports interpretability: a single direction suffices to extract the confidence signal. 17 The Confidence Manifold D. Baseline Method Details D.1. Semantic Entropy Analysis We implement semantic entropy following Farquhar et al. (2024) to understand why uncertainty-based methods underper- form. Protocol. (1) GenerateK = 5completions per prompt (nucleus sampling,p = 0.9,T = 0.7). (2) Cluster by semantic equivalence via bidirectional NLI (DeBERTa-v3-large): two responses are equivalent if NLI predicts entailment in both directions. (3) Compute entropy: SE =− P c p c logp c over cluster probabilities. Results on TruthfulQA. Mean SE for correct answers:0.060± 0.089; for incorrect answers:0.115± 0.142. Cohen’s d = 0.47 (small-medium effect). Classification AUC = 0.58, far below probe AUC (0.77–0.92). Why SE underperforms. TruthfulQA tests common misconceptions: questions where humans frequently give wrong answers. Models inherit these misconceptions and assert them confidently. The 0.055 SE gap confirms incorrect answers are slightly more uncertain on average, but the effect is too weak for reliable detection. Semantic entropy detects uncertainty, not incorrectness: these are distinct signals, and TruthfulQA specifically targets confident errors. D.2. SAE Feature Analysis We analyze whether pretrained Sparse Autoencoders (SAEs) can provide interpretable confidence features. If confidence localizes to specific SAE features, this would enable mechanistic interpretation of confidence encoding. Setup. Neuronpedia gpt2-small-res-jb SAE (layer 6, 24,576 features, 32×expansion). Layer 6 is the optimal detection layer for GPT-2-small (Appendix E.2). Feature statistics. By activation frequency: sparse (<1%): 13 features; moderate (1–10%): 21,857 features (88.9%); dense (>10%): 2,706 features (11.0%). The heavy tail toward moderate activation suggests most features are contextually specific. Correctness correlation. Of 24,576 features, only 307 show significant correlation with correctness labels (p < 0.05, Bonferroni-correctedp < 2× 10 −6 ). Maximum|r| = 0.238. Top positive features (incorrect-associated): uncertainty markers, hedging language, abstract concepts. Top negative features (correct-associated): named entities, numerical expressions, specific facts. Limitation. Individual SAE features explain<6% of variance (r 2 < 0.057), while linear probes explain 35%+. Confidence is distributed across many features, not localized to interpretable atoms. This motivates our probe-based approach over feature-based interpretability. E. Extended Results E.1. Small Instruction-Tuned Models E.2. Layer-wise Performance Table Table 7. Layer-wise performance across GPT-2 family. AUC and intrinsic dimension (Dim) at key depth percentiles. All models show AUC increase concurrent with dimension compression, supporting the hypothesis that compression and confidence encoding co-occur. GPT-2GPT-2-MediumGPT-2-Large DepthAUCDimAUCDimAUCDim 0%0.63224.30.61432.10.59841.2 25%0.64818.70.65224.60.66128.9 50%0.70414.20.69118.30.72221.4 75%0.67011.50.73812.10.76814.7 100%0.7728.90.7998.30.81211.8 E.3. Full Cross-Dataset Results Table 8 provides complete cross-dataset transfer results across all models and datasets. Key observations: 18 The Confidence Manifold 20 10 0 10 20 30 PLS1 20 10 0 10 20 PLS2 30 20 10 0 10 20 PLS3 Qwen2-1.5B AUC=0.91 Correct Incorrect 20 10 0 10 20 30 PLS1 20 15 10 5 0 5 10 15 20 PLS2 20 15 10 5 0 5 10 15 20 PLS3 Llama-1B AUC=0.93 20 10 0 10 20 30 PLS1 15 10 5 0 5 10 15 20 PLS2 20 15 10 5 0 5 10 15 PLS3 Gemma-2B AUC=0.93 Figure 8. 3D PLS visualization of the confidence manifold for smaller instruction-tuned models (Qwen2-1.5B, Llama-1B, Gemma-2B). These models achieve AUC 0.91–0.93, comparable to larger instruction-tuned models. However, the visual separation appears less distinct due to higher overlap in the projected 3D space, despite strong full-dimensional classification performance. •In-domain performance: All instruction-tuned models achieve>0.87 AUC on TruthfulQA; GPT-2 family achieves 0.73–0.78 AUC • HaluEval anomaly: Cross-domain transfer to HaluEval is consistently below random (0.18–0.36 AUC). HaluEval tests LLM-generated hallucinations in dialogue/QA contexts, where errors arise from generation failures rather than factual misconceptions. The inverted transfer suggests TruthfulQA’s “confident misconception” signal is anti-correlated with HaluEval’s “generation failure” signal •FEVER transfer: Best cross-domain performance (0.51–0.76 AUC), likely because FEVER’s fact verification task is closest to TruthfulQA’s factuality assessment Table 8. Complete cross-dataset transfer results. In-domain: train and test on the same dataset (5-fold GroupKFold). Cross-domain: train on TruthfulQA, test on others (zero-shot transfer). Values are AUC± std across seeds. ModelTruthfulQASciQCSQAHaluEvalFEVER In-domain evaluation Qwen2-7B0.901 ±0.002 0.953 ±0.004 0.897 ±0.016 0.984 ±0.0001 0.934 ±0.001 Qwen2-1.5B0.877 ±0.001 0.883 ±0.014 0.819 ±0.0002 0.986 ±0.00002 0.898 ±0.001 Llama-1B0.924 ±0.0003 0.830 ±0.007 0.719 ±0.036 0.992 ±0.0001 0.881 ±0.001 GPT-2-Large0.751 ±0.0001 0.641 ±0.004 0.621 ±0.022 0.982 ±0.00 0.863 ±0.0005 GPT-2-Medium0.750 ±0.00005 0.623 ±0.013 0.657 ±0.038 0.985 ±0.00001 0.872 ±0.0004 GPT-20.732 ±0.0003 0.579 ±0.013 0.546 ±0.012 0.977 ±0.00001 0.837 ±0.0002 Cross-domain (train TruthfulQA→ test others) Qwen2-7B–0.676 ±0.004 0.682 ±0.017 0.237 ±0.003 0.720 ±0.027 Qwen2-1.5B–0.607 ±0.006 0.574 ±0.010 0.325 ±0.031 0.759 ±0.001 Llama-1B–0.611 ±0.008 0.604 ±0.005 0.181 ±0.001 0.679 ±0.002 GPT-2-Large–0.541 ±0.015 0.522 ±0.001 0.472 ±0.001 0.612 ±0.001 GPT-2-Medium–0.537 ±0.014 0.513 ±0.011 0.215 ±0.003 0.624 ±0.001 GPT-2–0.542 ±0.011 0.480 ±0.017 0.266 ±0.004 0.507 ±0.00003 E.4. Few-Shot Label Efficiency For each budget N, we fit PLS using only those N samples, then compute centroids in the resulting 5D space. The centroid method matches or exceeds probe performance at all label budgets, demonstrating that geometric detection can be bootstrapped with minimal annotation. 19 The Confidence Manifold Table 9. Few-shot label efficiency (GPT-2, AUC). Centroid matches probe at all N. NProbeCentroidMahal. 50.59 ±.11 0.60 ±.09 0.59 ±.10 250.68 ±.06 0.69 ±.05 0.69 ±.06 1000.75 ±.01 0.76 ±.01 0.75 ±.01 2000.78 ±.02 0.78 ±.02 0.78 ±.02 E.5. PLS Improves Cross-Domain Transfer PLS dimension reduction not only prevents overfitting (Table 1) but also improves cross-domain generalization. We hypothesize that PLS removes dataset-specific noise while preserving the universal confidence signal. Setup. Train linear probe on TruthfulQA embeddings, test on SciQ, CommonsenseQA, and FEVER. Compare full- dimensional (3584D for Qwen2-7B) vs. 5D PLS projection. 4 runs with different seeds. Table 10. PLS improves cross-domain transfer. Train on TruthfulQA, test on other datasets. PLS 5D outperforms full-dimensional embeddings by 10–14% absolute AUC. MethodSciQCSQAFEVER Full (3584D)0.50 ±0.01 0.47 ±0.01 0.69 ±0.00 PLS 5D0.64 ±0.01 0.61 ±0.02 0.79 ±0.00 ∆ (absolute)+0.14+0.14+0.10 Interpretation. Full-dimensional probes memorize TruthfulQA-specific patterns (near-random transfer: 0.47–0.50 AUC on SciQ/CSQA). PLS extracts the 5D subspace maximally correlated with correctness labels, discarding dataset-specific variance. This 5D signal transfers: +14% on CSQA, +14% on SciQ, +10% on FEVER. The result suggests the confidence signal is universal but obscured by high-dimensional noise in full embeddings. F. Activation Steering Details Experimental setup. Steering requires a single fixed direction for causal analysis, unlike classification which uses cross- validation. We use a holdout split: first 200 questions for probe training (to obtain the steering direction), remainingn = 617 for evaluation. This differs from our classification protocol (GroupKFold CV) because steering tests whether one specific direction causally affects outputs, not classification generalization. Protocol. Steering coefficientα∈ [−5, 5]with 20 values. The probe weight vector is L2-normalized and scaled to 5% of mean activation norm. Generation uses greedy decoding (temperature=0). Correctness is determined by semantic match to TruthfulQA ground-truth best answers. Statistical significance. Learned direction:+10.9p total effect (p < 0.001, two-samplet-test comparingα = −5vs α = +5). Random direction:+1.8p (p = 0.59, not significant). Orthogonal direction:+1.6p (p = 0.64, not significant). The learned direction effect is statistically significant while control directions show no significant effect. G. Nested Cross-Validation To verify that hyperparameter selection (layer, PLS dimension) does not inflate reported test AUC, we perform nested cross-validation with proper separation between selection and evaluation. Protocol. Outer loop: 5-fold GroupKFold for final evaluation. Inner loop: 3-fold StratifiedKFold within each training set for hyperparameter selection. PLS dimensions searched:1, 2, 3, 4, 5, 6, 7, 8, 12, 16. The inner loop selects the optimal dimension; the outer loop evaluates on truly held-out data. Results. Table 11 shows negligible bias between nested and standard CV. Qwen2-7B: +0.005 (within noise). GPT-2-Large: −0.026 (standard CV is conservative). The inner CV consistently selects 5–8D, matching our fixed choice. Conclusion: reported AUCs are unbiased estimates; hyperparameter selection does not inflate performance. 20 The Confidence Manifold Table 11. Nested CV shows no optimistic bias. Comparing nested CV (unbiased) vs. standard CV (fixed dim=8). Bias = Standard− Nested. Negative bias indicates standard CV is actually conservative. ModelNested CVStandard CVBias Qwen2-7B0.905 ±0.040 0.910 ±0.039 +0.005 GPT-2-Large0.788 ±0.048 0.761 ±0.053 −0.026 H. Paraphrase Control Experiment To verify that the confidence manifold encodes correctness rather than answer style, we test whether geometric separation persists across paraphrased answers. Protocol. For each answer (correct or incorrect), we generate 5 paraphrase variants using templates: (1) original, (2) “The answer is: [answer]”, (3) “To be precise, [answer]”, (4) “In other words, [answer]”, (5) “Simply put, [answer]”. We compute PLS embeddings for all variants (Qwen2-7B, 817 TruthfulQA questions×5 paraphrases×2 correctness labels = 8,170 embeddings) and analyze variance in the discriminative subspace at layer 20 (optimal for Qwen2-7B). Variance decomposition. • Within-answer variance (same correctness, different paraphrase): 242.80 • Between-answer variance (different correctness): 4,223.64 • F-ratio: 17.40 The 17×ratio confirms that correctness dominates the geometric separation. Paraphrase style contributes only 5.4% to variance in the discriminative subspace (1/17.40). If probes were detecting stylistic artifacts (sentence structure, prefix patterns), within-answer variance would be comparable to between-answer variance. The high F-ratio rules out this confound. Generalization with no question overlap. To ensure probes generalize beyond surface style, we use GroupKFold with question ID grouping: training on original answers from train questions, testing on paraphrased answers from held-out questions (no overlap). Results: • Train AUC (original answers): 0.998 • Test AUC (paraphrased answers, unseen questions): 0.926± 0.011 • Degradation: 7.2% The 0.926 test AUC demonstrates that probes trained on original answers generalize robustly to paraphrased answers on unseen questions, confirming detection of correctness rather than style. The modest 7.2% degradation indicates some paraphrase-specific features exist but do not dominate the confidence signal. 21