Paper deep dive

Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

Eric Yocam, Varghese Vaidyan, Gurcan Comert, Paris Kalathas, Yong Wang, Judith L. Mwakalonge

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 64

Abstract

Abstract:Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation -- requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline -- demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/13/2026, 1:08:08 AM

Summary

The paper introduces Adaptive Activation Cancellation (AAC), an inference-time framework for mitigating hallucinations in Large Language Models. AAC treats hallucination-associated neural activations as structured interference within the transformer residual stream, analogous to adaptive noise cancellation in signal processing. By identifying 'Hallucination Nodes' (H-Nodes) via layer-wise linear probing and applying a confidence-weighted forward hook, the method suppresses hallucination without requiring external knowledge, fine-tuning, or additional inference passes, while preserving model capability and reasoning accuracy.

Entities (5)

Adaptive Activation Cancellation · framework · 100%LLaMA-3-8B · large-language-model · 100%TruthfulQA · dataset · 100%H-Nodes · neural-component · 95%Transformer Residual Stream · model-architecture · 95%

Relation Signals (3)

Adaptive Activation Cancellation → evaluatedon → TruthfulQA

confidence 100% · Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval

H-Nodes → identifiedby → Linear Probing

confidence 95% · The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing

Adaptive Activation Cancellation → mitigates → Hallucination

confidence 95% · We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference

Cypher Suggestions (2)

Identify methods used to detect H-Nodes · confidence 95% · unvalidated

MATCH (n:NeuralComponent {name: 'H-Nodes'})-[:IDENTIFIED_BY]->(m:Method) RETURN m.name

Find all models evaluated using the AAC framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'Adaptive Activation Cancellation'})-[:EVALUATED_ON]->(d:Dataset)<-[:EVALUATED_ON]-(m:Model) RETURN m.name, d.name

Full Text

63,697 characters extracted from source content.

Expand or collapse full text

Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models Eric Yocam1, Varghese Vaidyan1, Gurcan Comert2, Paris Kalathas3, Yong Wang4, and Judith L. Mwakalonge5 Abstract. Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation—requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04+0.04; MC2 +0.003+0.003; Token-F1 +0.003+0.003) while achieving probe–space selectivity 5.94×5.94× - 3.5× higher than the ITI baseline – demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability. Key words and phrases: hallucination mitigation, large language models, adaptive noise cancellation, mechanistic interpretability, inference-time intervention, H-Nodes, transformer activations, linear probing 1The Beacom College of Computer and Cyber Sciences, Dakota State University, Madison, SD 57042, USA 2Department of Computational Data Science and Engineering, North Carolina A&T State University, Greensboro, NC 27411, USA 3Department of Computer Science and Software Engineering, California Polytechnic State University, San Luis Obispo, CA 93407, USA 4Department of Computer Science, University of Idaho, Moscow, ID 83844, USA 5Department of Civil and Mechanical Engineering Technology, South Carolina State University, Orangeburg, SC 29115, USA 1. Introduction Transformer-based LLMs [21, 5, 17] have achieved state-of-the-art performance across a broad range of natural language tasks, yet they remain susceptible to hallucination: generating confident, fluent, but factually incorrect output [15, 9, 3, 4]. In high-stakes domains such as medicine, law, and education, factual accuracy is non-negotiable, making reliable hallucination mitigation a practical necessity. Existing mitigation strategies fall into three broad families. Retrieval augmentation [10] grounds the generation of retrieved documents at inference time but requires an external knowledge source. Post-hoc verification [15] uses a second model or knowledge base to score or filter generated text after the fact. Knowledge editing [6] modifies model parameters to update factual associations, but requires retraining. All three operate outside the model’s internal computation and therefore do not address the generative mechanism itself. Mechanistic interpretability research has shown that language models encode factual information in structured internal representations: feed-forward layers act as key-value memories [8], factual associations localise in specific neurons [16, 6], and truth-related representations form emergent linear structure in activation space [14]. Building on this perspective, we treat hallucination as a structured interference signal within the transformer residual stream and propose suppressing it using techniques adapted from classical adaptive noise cancellation (ANC) [23, 24]. Table 1 summarises the eight principal contributions of this work. Together, they establish AAC as a complementary inference-time approach that operates directly on internal activations, requires no external knowledge, no fine-tuning, and produces no measurable capability degradation. Table 1. Summary of principal contributions. # Contribution 1 Formal analogy between the transformer residual stream and an ANC primary channel. 2 Algorithm for identifying H-Nodes via signed probe weights and percentile baselines. 3 Real-time forward hook that suppresses H-Node activations during generation. 4 Empirical analysis across five cancellation variants and three scales (163M–8B). 5 Ablation confirming adaptive confidence-weighting reduces grounded drift by 25.9–40.1%. 6 Quantitative comparison with ITI [11] and DoLA [7]. 7 Capability preservation on WikiText-103 perplexity and MMLU at all three scales. 8 Mechanistic profiling revealing cross-model hallucination attractors in H-Nodes. 2. Signal Processing Analogy This section establishes the formal correspondence between classical adaptive noise cancellation and our framework, motivating the design choices made throughout the Method section. The analogy is not merely illustrative: it determines the specific form of the cancellation update, the role of the percentile baseline as a noise estimate, and the choice of a continuously-applied forward hook over a one-shot post-hoc correction. Adaptive noise cancellation is a classical technique in which a reference signal correlated with an interference component is used to construct and subtract an estimate of that interference from a corrupted primary signal [23, 24]. The canonical LMS-based ANC update rule adapts the filter weights tw_t to minimise the residual error ete_t: (1) t+1=t+2μett,et=dt−t⊤t,w_t+1=w_t+2μ\,e_t\,x_t, e_t=d_t-w_t x_t, where tx_t is the reference input, dtd_t is the primary (corrupted) signal, ete_t is the cleaned error signal, and μ is the step size. We draw an explicit parallel to the transformer residual stream. Let ℓ∈ℝdh_ ^d denote the hidden state at layer ℓ for the answer token. We decompose it as ℓ=ℓ+ℓh_ =s_ +n_ , where ℓs_ represents grounded semantic content and ℓn_ represents hallucination-associated interference. Table 2 formalises the component-wise correspondence. Table 2. Formal analogy between classical ANC and Adaptive Activation Cancellation. ANC Component Signal AAC Counterpart Primary channel dt=st+ntd_t=s_t+n_t Hidden state ℓ=ℓ+ℓh_ =s_ +n_ Noise reference tx_t H-Node activations ℓ[ℋ]h_ [H] Adaptive filter tw_t Probe weights probew_probe Error signal ete_t Residual after cancellation ℓ′h _ Cancellation coefficient μ α=0.9α=0.9 (attenuation scale) Noise estimate t⊤tw_t x_t Excess above percentile baseline The key structural difference from classical ANC is that the noise reference is derived from the primary signal itself rather than an independent sensor, making AAC analogous to single-channel noise suppression. The complete three-phase implementation—offline probe training, H-Node identification, and real-time generation hook—is described formally in Section 3. 3. Method The AAC pipeline operates in three sequential phases: offline probe training to identify the best hallucination-discriminating layer, H-Node identification and baseline construction at that layer, and real-time generation with a forward hook. 3.1. Activation Extraction and Pooling For each prompt in the dataset, hidden states ℓ=0L\h_ \_ =0^L are extracted at all transformer layers. Extraction uses the flag output_hidden_states=True, which returns the full residual stream at every depth. Last-token pooling takes the representation at the final non-padding position, ℓ=ℓ[−1,:]h_ =H_ [-1,:]; since autoregressive models condition the next token on all previous context, the last-token position aggregates the most predictive signal for hallucination [18]. Mean pooling averages over all non-padding positions: ℓ=1T∑t=1Tℓ[t,:].h_ = 1T _t=1^TH_ [t,:]. As reported in Section 5, last-token pooling consistently outperforms mean pooling at every layer, motivating its exclusive use in the cancellation pipeline. The advantage narrows at larger scale (from +0.247+0.247 AUC for Phi-3-mini to +0.036+0.036 for LLaMA 3-8B), as mean-pool representations strengthen across the full sequence. 3.2. Layer-wise Hallucination Separability At each layer ℓ , an L2L_2-regularised logistic regression probe fℓ:ℝd→[0,1]f_ :R^d→[0,1] is trained on a balanced 50% split of the activation dataset following the linear probing methodology of Alain and Bengio [1] and Tenney et al. [19]: (2) min∑iℒBCE(f(ℓ(i)),y(i))+λ‖22. _w _iL_BCE(f(h_ ^(i)),y^(i))+λ\|w\|_2^2. Separability is quantified by ROC-AUC on a held-out 25% evaluation split, Cohen’s d between hallucinated and grounded activation norm distributions, and ℓ2 _2 centroid distance between class centroids. The best layer ℓ∗=arg⁡maxℓ⁡AUCℓ ^*= _ AUC_ is selected as the H-Node extraction and cancellation point. 3.3. H-Node Identification H-Nodes are the top-K neurons with the largest signed probe weight magnitude in the direction of hallucination. Signed weights preserve directional structure: neurons with strong positive weight increase hallucination confidence, while those with strong negative weight suppress it. Given probe weight vector ∈ℝdw ^d: (3) ℋ=top-K(),K=50.H=top-K(w), K=50. A percentile baseline ∈ℝKb ^K encodes the typical grounded activation level, computed as the 80th percentile of H-Node activations over the 25% cancellation split: (4) bj=pct80(hj(i):y(i)=0i),j∈ℋ.b_j=pct_80\! (\h_j^(i):y^(i)=0\_i ), j . Only activations exceeding bjb_j are treated as excess hallucination signal eligible for suppression. 3.4. Cancellation Strategies Six cancellation variants are implemented; five operate post-hoc on a single forward pass over the held-out evaluation split and one registers a live forward hook during autoregressive generation. Table 3 describes each variant. The general cancellation update for post-hoc methods is: (5) ′[ℋ]=[ℋ]−α⋅max⁡([ℋ]−, 0),h [H]=h[H]-α· \! (h[H]-b,\,0 ), and the Amplify variant additionally boosts anti-hallucination neurons (ℋ¯ H, those with strongly negative probe weights): (6) ′[ℋ¯]=[ℋ¯]+α⋅max⁡(−[ℋ¯], 0).h [ H]=h[ H]+α· \! (b-h[ H],\,0 ). Table 3. Cancellation strategy descriptions and operating mode. Method Mode Description Mean baseline Post-hoc Subtract the mean H-Node activation over grounded samples. Pct80 H-Node Post-hoc Suppress excess above the 80th percentile baseline (Eq. 5). Pct80 Amplify Post-hoc Suppress pro-hallucination H-Nodes; boost anti-hallucination neurons (Eq. 6). Pct80 Fourier Post-hoc Apply FFT to the excess signal, retain the top-5 spectral components, and subtract; requires excess >0.01>0.01 to avoid injecting noise. Pct80 Zero Post-hoc Clamp H-Node activations to the percentile baseline. Real-time Hook Generation Register forward_hook on layer ℓ∗ ^*; intercept hidden state, compute excess, apply confidence-weighted attenuation at every generation step. The Fourier variant requires brief theoretical motivation, since FFT is not typically applied to non-sequential data. We treat the hidden dimension as a spatial signal across the embedding manifold: the d-dimensional excess vector =max⁡([ℋ]−,)e= (h[H]-b,0) is a one-dimensional signal indexed by neuron position. Hallucination-associated interference tends to produce structured, low-frequency patterns across this spatial signal—neurons that co-activate during hallucination are not randomly distributed but cluster in correlated groups. The FFT decomposes this signal into its spectral components, retaining only the top-5 (dominant frequency modes), and subtracts the reconstructed interference. The empirical selectivity results (4.20×4.20× for OPT-125M, 5.39×5.39× for LLaMA 3-8B) confirm that this spectral decomposition captures meaningful structure in the excess signal. The real-time hook applies an additional confidence-weighted scale factor. Let c=fℓ∗()c=f_ ^*(h) denote the probe confidence for the current hidden state. The adaptive attenuation becomes: (7) ′[ℋ]=[ℋ]−c⋅α⋅max⁡([ℋ]−, 0),c=fℓ∗(),α=0.9.h [H]=h[H]-c·α· \! (h[H]-b,\,0 ), c=f_ ^*(h),\;\;α=0.9. This modulates suppression strength proportionally to how confidently the probe classifies the current hidden state as hallucinated, reducing unnecessary attenuation on ambiguous or grounded samples. Algorithm 1 presents the offline probe training phase; Algorithm 2 presents the real-time forward hook. Input: Dataset =(prompti,yi)D=\(prompt_i,y_i)\, model ℳM, K=50K=50, percentile p=80p=80 Output: Best layer ℓ∗ ^*, H-Node set ℋH, baseline b, probe fℓ∗f_ ^* 1 Split D into trainD_train (50%), cancelD_cancel (25%), evalD_eval (25%) 2 foreach layer ℓ=0,…,L =0,…,L do 3 Extract last-token hidden states ℓ(i)\h_ ^(i)\ from ℳM for all i∈traini _train 4 Train L2L_2-logistic probe fℓf_ on (ℓ(i),yi)\(h_ ^(i),y_i)\ (Eq. 2) 5 Compute AUCℓAUC_ on evalD_eval 6 7ℓ∗←arg⁡maxℓ⁡AUCℓ ^*← _ \;AUC_ 8 ←w← probe weights of fℓ∗f_ ^* ℋ←top-K()H -K(w) // signed weight ranking, Eq. 3 9 Extract ℓ∗(i)[ℋ]\h_ ^*^(i)[H]\ for grounded samples (y(i)=0y^(i)\!=\!0) in cancelD_cancel bj←pctp(hj(i):y(i)=0i)b_j _p\! (\h_j^(i):y^(i)=0\_i ) for each j∈ℋj // Eq. 4 return ℓ∗ ^*, ℋH, b, fℓ∗f_ ^* Algorithm 1 AAC Offline Probe Training and H-Node Identification Input: Hidden state h at layer ℓ∗ ^*, probe fℓ∗f_ ^*, H-Nodes ℋH, baseline b, α=0.9α=0.9, threshold θ=0.45θ=0.45 Output: Modified hidden state ′h c←fℓ∗()c← f_ ^*(h) // probe confidence: P(hallucinated) 1 if c>θc>θ then ←max⁡([ℋ]−, 0)e← (h[H]-b,\;0) // excess above grounded baseline ′[ℋ]←[ℋ]−c⋅α⋅h [H] [H]-c·α·e // adaptive attenuation, Eq. 7 ′[ℋ¯]←[ℋ¯]h [ H] [ H] // non-H-Node dims unchanged 2 3else ′←h // below threshold: pass through 4 return ′h Algorithm 2 AAC Real-time Forward Hook (runs at every autoregressive generation step) 3.5. Evaluation Metrics Four metrics quantify activation-space cancellation quality, and three metrics assess generation-level effect. The selectivity ratio is the primary activation-space diagnostic: (8) Sel=ReducDrift=Δy^hallΔy^grnd,Sel= ReducDrift= y_hall y_grnd, where Δy^hall y_hall is the decrease in probe confidence on hallucinated samples and Δy^grnd y_grnd is the corresponding change on grounded samples. Values of Sel>1Sel>1 indicate that hallucination suppression exceeds collateral grounded degradation. All seven metrics are listed with their definitions and desired directions in Table 4. Table 4. Evaluation metrics, definitions, and desired direction. Metric Symbol Definition Desired Hallucination confidence reduction Reduc Decrease in probe conf. on hallucinated samples ↑ Grounded drift Drift Change in probe conf. on grounded samples ↓ Selectivity Sel Reduc / Drift (Eq. 8) ↑ (>1>1) Separation change SepΔ Hallucinated −- grounded confidence gap change ↑ MC1 accuracy MC1 Fraction of TruthfulQA correct via log-prob ranking ↑ MC2 truthfulness MC2 Normalised prob. mass over true answers [12] ↑ Token-F1 F1 Token-level overlap with reference answers ↑ 4. Experimental Setup This section describes the models, datasets, and implementation choices used throughout all experiments. Model selection was guided by the goal of spanning a wide parameter scale range (49×49× from smallest to largest) while using publicly available open-source weights. Dataset selection favoured benchmarks with explicit per-sample truthfulness labels that drive both probe training and generation evaluation. 4.1. Models The three models and their key architectural properties are summarised in Table 5. Each represents a distinct scale regime, enabling analysis of how hallucination representations and cancellation efficacy change with model capacity. Table 5. Model specifications. All models are used with frozen weights; no fine-tuning is performed. Model Params Layers Hidden dim Dtype Loading OPT-125M [27] 163.8M 12 768 float32 Standard Phi-3-mini 3,821M 32 3072 float16 device-map LLaMA 3-8B [20] 8,030M 32 4096 bfloat16 device-map 4.2. Datasets Two publicly available benchmarks provide the labelled samples required for probe training and evaluation. Table 6 summarises their properties and roles in the pipeline. Table 6. Datasets used for activation extraction, probe training, and generation evaluation. Dataset Samples Label type Role in pipeline TruthfulQA [12] 600 (validation) MC correct / incorrect Activation extraction, probe training, cancellation baseline, generation evaluation HaluEval 600 Explicit hallucination field Cross-benchmark generalisation evaluation 4.3. Implementation Details All probes use scikit-learn’s L2L_2-regularised LogisticRegression with the dataset split 50/25/25 for probe training, cancellation baseline construction, and held-out evaluation respectively. The H-Node count is K=50K=50, the attenuation scale is α=0.9α=0.9, the confidence activation threshold is θ=0.45θ=0.45, and the baseline is set at the 80th percentile. Generation evaluation uses n=100n=100 samples with max_new_tokens=30. All experiments use frozen model weights; the only inference-time modification is the optional forward hook on layer ℓ∗ ^*. 5. Results The results are organised in eleven subsections, proceeding from basic characterisation through the core cancellation comparison to the new ablation, baseline comparisons, capability preservation, and mechanistic profiling experiments. The overarching finding is that hallucination-associated activations are linearly separable at all three scales, separability peaks near 50% network depth in all models, and the real-time forward hook is the only intervention that consistently improves downstream accuracy. 5.1. Pooling Strategy Comparison The Last-token pool outperforms the mean pooling in all layers for all three models, as shown in Table 7. The hallucination signal concentrates in the final answer token, which means that the clustering dilutes throughout the sequence [5, 18]. Notably, the last-token advantage shrinks substantially at larger scale: LLaMA 3-8B’s mean-pool representation already achieves 0.862 AUC, reflecting that at sufficient capacity the hallucination signal spreads into the full sequence representation rather than remaining concentrated at the last position. Table 7. Best-layer AUC: last-token vs. mean-pool pooling across all three models. Model Best Layer Last-Token AUC Mean-Pool AUC Gain OPT-125M 6 0.7535 0.6270 +0.1264 Phi-3-mini 17 0.8877 0.6402 +0.2474 LLaMA 3-8B 15 0.8979 0.8617 +0.0362 The last-token advantage opens at layer 1 and all models peak near 50% network depth before declining toward the output layers, consistent with the layer sweep data in Tables 8–10. 5.2. Full Layer Sweep Tables 8, 9, and 10 provide the complete per-layer AUC for all three models. OPT-125M peaks at layer 6 (AUC 0.754, 50% depth) then declines sharply. Phi-3-mini peaks at layer 17 (AUC 0.888, 53% depth) and remains above 0.82 through the final layers, indicating persistent hallucination geometry throughout the deeper network. LLaMA 3-8B peaks at layer 15 (AUC 0.898, 46% depth) and sustains high separability throughout, with mean-pool gradually approaching last-token by the final layers, consistent with the spreading of the hallucination signal into full-sequence representations discussed in Section 6.2. Table 8. OPT-125M per-layer AUC. Best layer highlighted. Layer Last-Token AUC Mean-Pool AUC Gain Depth 0 0.4337 0.4891 −-0.0554 0% 1 0.5426 0.4396 ++0.1030 8% 2 0.6105 0.4206 ++0.1899 17% 3 0.6594 0.4340 ++0.2254 25% 4 0.7251 0.4568 ++0.2683 33% 5 0.7237 0.5264 ++0.1973 42% 6 0.7535 0.5709 ++0.1826 50% 7 0.7485 0.6023 ++0.1462 58% 8 0.7489 0.6144 ++0.1344 67% 9 0.7314 0.6270 ++0.1043 75% 10 0.6968 0.6144 ++0.0823 83% 11 0.6491 0.5901 ++0.0590 92% 12 0.5848 0.4849 ++0.0999 100% Table 9. Phi-3-mini per-layer AUC (selected layers). Best layer highlighted. Layer Last-Token AUC Mean-Pool AUC Gain Depth 0 0.4944 0.4685 ++0.0259 0% 3 0.6457 0.4181 ++0.2275 9% 4 0.7019 0.4070 ++0.2948 13% 7 0.7198 0.4530 ++0.2668 22% 9 0.7759 0.5137 ++0.2622 28% 11 0.7925 0.5574 ++0.2351 34% 13 0.8184 0.5553 ++0.2631 41% 15 0.8805 0.5905 ++0.2900 47% 16 0.8852 0.5983 ++0.2869 50% 17 0.8877 0.6402 ++0.2474 53% 18 0.8767 0.6212 ++0.2554 56% 20 0.8663 0.6016 ++0.2647 63% 23 0.8712 0.5633 ++0.3079 72% 26 0.8526 0.5080 ++0.3446 81% 29 0.8353 0.4757 ++0.3596 91% 32 0.8304 0.4481 ++0.3822 100% Table 10. LLaMA 3-8B per-layer AUC (selected layers). Best layer highlighted. Layer Last-Token AUC Mean-Pool AUC Gain Depth 0 0.4944 0.5000 −-0.0056 0% 2 0.6552 0.5098 ++0.1454 6% 4 0.7694 0.5398 ++0.2296 13% 6 0.8111 0.6640 ++0.1472 19% 8 0.8456 0.7349 ++0.1106 25% 10 0.8586 0.7999 ++0.0588 31% 12 0.8625 0.8193 ++0.0432 38% 13 0.8904 0.8578 ++0.0326 41% 14 0.8946 0.8617 ++0.0328 44% 15 0.8979 0.8528 ++0.0451 47% 16 0.8885 0.8468 ++0.0417 50% 18 0.8685 0.8298 ++0.0388 56% 21 0.8743 0.8257 ++0.0486 66% 25 0.8479 0.7988 ++0.0491 78% 29 0.8325 0.7769 ++0.0556 91% 32 0.8353 0.7443 ++0.0910 100% 5.3. Activation Trajectory Summary Table 11 compares trajectory statistics across all three models. Phi-3-mini shows a 25×25× larger centroid distance and 3×3× larger Cohen’s d than OPT-125M, reflecting a more geometrically structured hallucination representation. LLaMA 3-8B extends this trend: Cohen’s d reaches 0.577 (4.4×4.4× OPT-125M) while centroid distance partially consolidates to 22.5, as the wider hidden dimension (4096) concentrates the signal more compactly per neuron. Crucially, the hallucination signal fraction above the 80th percentile baseline grows monotonically—11.1 p for OPT-125M, 13.3 p for Phi-3-mini, and 16.4 p for LLaMA 3-8B—confirming that larger models produce a stronger and more committed hallucination signal. Table 11. Activation trajectory statistics across all three models. Metric OPT-125M Phi-3-mini LLaMA 3-8B Total layers 1212 3232 3232 AUC peak layer 66 1717 1515 AUC peak value 0.75350.7535 0.88770.8877 0.89790.8979 AUC peak depth (%) 5050 5353 4646 Separation peak layer 99 1616 1515 Separation peak value 0.25390.2539 0.55370.5537 0.52800.5280 Max Cohen’s d 0.13030.1303 0.41380.4138 0.57680.5768 Max centroid distance 2.682.68 67.0667.06 22.5022.50 Max AUC gain (last −- mean) 0.26830.2683 0.38230.3823 0.22960.2296 Probe AUC on cancel split 0.82340.8234 0.93210.9321 0.94300.9430 Baseline separation above pct80 11.111.1 13.313.3 16.416.4 5.4. Cancellation Method Comparison Table 12 shows cancellation results for OPT-125M. All five post-hoc methods achieve positive selectivity (Sel >> 1), confirming that hallucination activations are attenuated more than grounded ones. The Fourier method achieves the highest selectivity at 4.20×4.20×, indicating that retaining only dominant spectral components of the excess signal provides the most targeted suppression. Despite these encouraging probe-space results, no post-hoc method improves downstream accuracy—only the real-time hook does, a finding explained in Section 8. Table 12. OPT-125M cancellation results. ⋆ denotes Sel >> 1. Method Reduc Drift Supp% Sel SepΔ Post-hoc Mean 0.0731 0.0222 12.7% 3.29⋆ −-0.051 Post-hoc Pct80 H-Node 0.0259 0.0074 4.5% 3.48⋆ −-0.019 Post-hoc Pct80 Amplify 0.0393 0.0114 6.8% 3.44⋆ −-0.028 Post-hoc Pct80 Fourier 0.0133 0.0032 2.3% 4.20⋆ −-0.010 Post-hoc Pct80 Zero 0.0467 0.0138 8.1% 3.37⋆ −-0.033 Real-time Hook (ANC) −-0.0073 0.0281 −-1.3% −-0.26 ++0.035 Table 13 shows results for Phi-3-mini. Selectivity is uniformly lower (best: 1.72×1.72×) despite the stronger hallucination signal, because H-Nodes at this scale are more entangled with grounded features at the 3072-dimensional hidden space. Table 14 shows LLaMA 3-8B, where post-hoc selectivity recovers strongly (5.58×5.58× H-Node), consistent with the wider hidden dimension allowing cleaner H-Node isolation. Figure 1 visualises the selectivity pattern across all three models. Table 13. Phi-3-mini cancellation results. ⋆ denotes Sel >> 1. Method Reduc Drift Supp% Sel SepΔ Post-hoc Mean 0.0216 0.0143 2.9% 1.51⋆ −-0.007 Post-hoc Pct80 H-Node 0.0069 0.0045 0.9% 1.51⋆ −-0.002 Post-hoc Pct80 Amplify 0.0100 0.0058 1.3% 1.72⋆ −-0.004 Post-hoc Pct80 Fourier 0.0041 0.0028 0.6% 1.48⋆ −-0.001 Post-hoc Pct80 Zero 0.0111 0.0079 1.5% 1.40⋆ −-0.003 Real-time Hook (ANC) 0.0338 0.0345 4.5% 0.98 ++0.001 Table 14. LLaMA 3-8B cancellation results. ⋆ denotes Sel >> 1. Method Reduc Drift Supp% Sel SepΔ Post-hoc Mean 0.0190 0.0038 2.6% 5.06⋆ −-0.015 Post-hoc Pct80 H-Node 0.0067 0.0012 0.9% 5.58⋆ −-0.006 Post-hoc Pct80 Amplify 0.0106 0.0020 1.4% 5.42⋆ −-0.009 Post-hoc Pct80 Fourier 0.0047 0.0009 0.6% 5.39⋆ −-0.004 Post-hoc Pct80 Zero 0.0105 0.0019 1.4% 5.54⋆ −-0.009 Real-time Hook (ANC) 0.0502 0.0371 6.8% 1.35⋆ −-0.013 Figure 1. Cancellation selectivity by method and model. Values above the dashed line (Sel == 1) indicate net benefit. Post-hoc selectivity is non-monotonic across scale: Phi-3-mini is lowest (1.72×1.72×) while LLaMA 3-8B recovers to 5.58×5.58×. 5.5. Percentile Baseline Sweep The percentile threshold p in Eq. 4 governs the precision-recall trade-off for hallucination suppression. Table 15 sweeps this threshold for OPT-125M. Raising the threshold increases selectivity monotonically, reaching 8.57×8.57× at the 99th percentile, because drift drops faster than reduction: the most extreme H-Node activations are almost exclusively associated with hallucinated samples. Figure 2 shows this gain becomes super-linear above the 90th percentile. LLaMA 3-8B exhibits a different profile: selectivity peaks at the 85th percentile (5.64×5.64×) and plateaus or declines above the 90th, indicating that in larger models the most committed hallucination signal is more concentrated at moderate thresholds rather than at the extreme tail. Table 15. OPT-125M percentile sweep (H-Node cancellation). Percentile Reduc Drift Sel Sep 50th 0.0736 0.0224 3.28× 0.1977 60th 0.0565 0.0170 3.32× 0.2094 70th 0.0401 0.0119 3.37× 0.2206 75th 0.0329 0.0096 3.42× 0.2256 80th 0.0259 0.0074 3.48× 0.2304 85th 0.0205 0.0056 3.66× 0.2340 90th 0.0148 0.0037 3.98× 0.2378 95th 0.0083 0.0017 4.78× 0.2423 99th 0.0026 0.0003 8.57× 0.2466 Figure 2. Selectivity, reduction, and drift vs. percentile threshold for OPT-125M. Selectivity rises super-linearly above the 90th percentile as drift approaches zero. 5.6. Downstream Accuracy Table 16 reports held-out classification accuracy, hallucination rate, and ROC-AUC across all three models and all methods. A clean dissociation emerges: every post-hoc method leaves accuracy flat across all scales, while the real-time hook is the only method that consistently improves accuracy—+0.020+0.020 for OPT-125M, +0.007+0.007 for Phi-3-mini, and +0.007+0.007 for LLaMA 3-8B. The ROC-AUC trends diverge between models: for OPT-125M the hook increases ROC-AUC from 0.786 to 0.809, while for LLaMA 3-8B it decreases slightly (0.916→0.9060.916→ 0.906). This divergence is interpreted in Section 8 as a signature of mechanistic delocalization at larger scale. Table 16. Downstream accuracy on the held-out TruthfulQA evaluation split. Model Method Accuracy Hall Rate ROC-AUC OPT-125M Baseline 0.6933 0.7125 0.7859 Post-hoc Mean 0.6933 0.7000 0.7836 Post-hoc Pct80 0.6933 0.7000 0.7854 Post-hoc Amplify 0.6933 0.7000 0.7838 Post-hoc Fourier 0.6933 0.7125 0.7864 Post-hoc Zero 0.6933 0.7000 0.7852 Real-time Hook 0.7133 0.7875 0.8091 Phi-3-mini Baseline 0.7933 0.7375 0.9086 Post-hoc Mean 0.7933 0.7375 0.9080 Post-hoc Pct80 0.7933 0.7375 0.9086 Post-hoc Amplify 0.7933 0.7375 0.9086 Post-hoc Fourier 0.7933 0.7375 0.9086 Post-hoc Zero 0.7933 0.7375 0.9086 Real-time Hook 0.8000 0.7375 0.9027 LLaMA 3-8B Baseline 0.8133 0.8125 0.9159 Post-hoc Mean 0.8133 0.8125 0.9157 Post-hoc Pct80 0.8133 0.8125 0.9159 Post-hoc Amplify 0.8133 0.8125 0.9155 Post-hoc Fourier 0.8133 0.8125 0.9159 Post-hoc Zero 0.8133 0.8125 0.9157 Real-time Hook 0.8200 0.8625 0.9055 5.7. Generation Evaluation Table 17 reports MC1, MC2 truthfulness, and token-F1 for free generation (n=100n=100). OPT-125M and Phi-3-mini baseline MC1 accuracy (0.24 and 0.29) falls at or below the near-chance threshold, so generation deltas for those models cannot be reliably interpreted [12, 26]; results are reported for completeness. LLaMA 3-8B, with baseline MC1 of 0.29, sits at the threshold boundary; however, its hook intervention produces positive improvements across every generation metric: MC1 +0.04+0.04, MC2 truthfulness +0.003+0.003, and Token-F1 +0.003+0.003. This constitutes the first consistent positive generation-level signal in the study. Table 17. Generation evaluation (n=100n=100). LLaMA 3-8B shows consistent positive improvement under the hook intervention. Model Condition MC1 MC2 Token-F1 EM OPT-125M Baseline 0.24 0.4179 0.1781 0.00 Hook ANC 0.23 0.4126 0.1728 0.00 Delta N/A (near-chance baseline) Phi-3-mini Baseline 0.29 0.4260 0.2418 0.00 Hook ANC 0.28 0.4250 0.2370 0.00 Delta N/A (near-chance baseline) LLaMA 3-8B Baseline 0.29 0.4306 0.2156 0.00 Hook ANC 0.33 0.4335 0.2183 0.00 Delta ++0.04 ++0.003 ++0.003 0.00 5.8. Ablation: Adaptive vs. Static Cancellation The confidence-weighted adaptive attenuation of Eq. 7 introduces a per-sample scale factor c that reduces suppression strength on samples the probe considers ambiguous. To isolate its contribution, Table 18 compares the adaptive hook against a static variant that applies uniform attenuation (c=1.0c=1.0). Adaptive weighting consistently reduces grounded drift by 25.9–40.1% while preserving comparable hallucination reduction, confirming that the confidence factor is load-bearing. The benefit is largest for OPT-125M (40.1%40.1\%), where per-sample confidence variance is highest, and smallest for LLaMA 3-8B (25.9%25.9\%), where the probe is highly confident on nearly all samples. Figure 3 visualises the drift reduction across all three scales. Table 18. Ablation: static (c=1.0c=1.0) vs. adaptive (c=c=conf) ANC. Adaptive reduces grounded drift at all three scales while preserving selectivity. Model Method Hall Grnd Reduc Drift Sel OPT-125M Baseline 0.5761 0.3272 — — — Static (c=1.0c=1.0) 0.5340 0.3148 0.0421 0.0124 3.39× Adaptive (c=c=conf) 0.5502 0.3198 0.0259 0.0074 3.48× Phi-3-mini Baseline 0.7393 0.0977 — — — Static (c=1.0c=1.0) 0.7203 0.0849 0.0190 0.0128 1.49× Adaptive (c=c=conf) 0.7245 0.0885 0.0148 0.0092 1.62× LLaMA 3-8B Baseline 0.7699 0.1489 — — — Static (c=1.0c=1.0) 0.7456 0.1448 0.0244 0.0042 5.83× Adaptive (c=c=conf) 0.7515 0.1458 0.0184 0.0031 5.94× Figure 3. Static vs. adaptive ANC: hallucination confidence, grounded confidence, and selectivity across all three model scales. Adaptive confidence weighting reduces grounded drift by 25.9–40.1%. 5.9. Comparison with ITI and DoLA To contextualise H-Node ANC within the inference-time intervention literature, this section compares it against two published baselines applied under identical conditions. Inference-Time Intervention (ITI) [11] computes a probing direction =normalize(¯hall−¯grnd)d=normalize( h_hall- h_grnd) and subtracts α(⋅)α(h·d)d, sweeping α∈5,10,15,20,30α∈\5,10,15,20,30\ with the best selectivity reported. Decoding by Contrasting Layers (DoLA) [7] re-scores each MC answer as log⁡Plate−0.5log⁡Pearly P_late-0.5 P_early using an early-exit layer at 38% depth. Results are summarised in Table 19 and Figure 4. The Polysemanticity Scale-Trap. The most architecturally revealing result is Phi-3-mini’s inversion: ITI’s global direction (1.88×1.88×) outperforms H-Node ANC (1.62×1.62×) at this scale, the only model where our method does not lead on probe selectivity. We attribute this to what we term the polysemanticity scale-trap. At the 3B–4B parameter range, individual neurons are empirically more polysemantic—simultaneously encoding multiple task-relevant features—than in either sub-billion or multi-billion models. This phenomenon has been documented in mechanistic interpretability work [3, 4]: as model capacity grows, representations first become more entangled before wider hidden dimensions eventually allow feature re-segregation. At Phi-3-mini’s 3072-dimensional hidden space, the H-Nodes selected by signed probe weights are particularly likely to be task-shared, meaning that suppressing them via a sparse (K=50K=50) intervention simultaneously disrupts grounded circuits at a disproportionate rate. A global direction, as ITI applies, averages across all d=3072d=3072 dimensions and is therefore less sensitive to the polysemantic entanglement of any individual neuron subset. This architectural insight—that sparse H-Node cancellation is most effective at the scale extremes, and most challenged at intermediate parameter counts—is itself a contribution: it identifies the polysemanticity regime as the primary design constraint for future sparse activation interventions. Complementary Operating Points. At OPT-125M and LLaMA 3-8B, H-Node ANC substantially outperforms ITI in probe selectivity: +1.92×+1.92× at OPT scale and +4.25×+4.25× at LLaMA scale (ANC 5.94×5.94× vs ITI 1.69×1.69×, a 3.5×3.5× advantage). For generation-level MC1 at LLaMA 3-8B, DoLA (+0.08+0.08) outperforms H-Node ANC (+0.04+0.04), because DoLA is optimised as a brute-force decoding-time accuracy booster: it re-ranks every answer by contrasting early and late layer distributions, maximising MC1 without regard for the internal specificity of the intervention. H-Node ANC occupies a different and complementary operating point: surgical diagnostic intervention. By suppressing only 50 neurons at a single layer, and only when probe confidence exceeds θ=0.45θ=0.45, it achieves 5.94×5.94× probe-space selectivity—confirming that the identified H-Nodes are the specific locus of hallucination signal, not merely correlated with it. When the design requirement is to isolate which neurons drive a specific factual error, or to suppress a targeted hallucination type without perturbing the broader representation, H-Node ANC is the appropriate tool; DoLA, operating at the decoding level with no access to internal neuron identity, cannot provide that diagnostic resolution. Neither method dominates across all metrics and scales; together they span the precision–recall frontier of inference-time hallucination mitigation. Table 19. H-Node ANC vs. ITI [11] and DoLA [7]. Methods occupy complementary operating points. Model Metric ITI DoLA H-Node ANC OPT-125M Best probe selectivity 1.56× — 3.48× MC1 delta N/A +0.03+0.03 N/A Phi-3-mini Best probe selectivity 1.88× — 1.62× MC1 delta N/A −0.07-0.07 N/A LLaMA 3-8B Best probe selectivity 1.69× — 5.94× MC1 delta — +0.08+0.08 +0.04+0.04 Figure 4. Probe selectivity (left) and MC1 generation delta (right) for ITI, DoLA, and H-Node ANC across all three model scales. H-Node ANC leads in selectivity at OPT and LLaMA scale; DoLA leads in MC1 at LLaMA scale. 5.10. Capability Preservation Many inference-time interventions trade general language capability for factual improvement, either increasing perplexity on fluent text or degrading reasoning accuracy on unrelated tasks. The capability preservation results reported here constitute one of the strongest empirical properties of the AAC framework: the forward hook produces exactly 0.0% change in WikiText-103 perplexity and zero change in MMLU subset accuracy at all three model scales. This is not a soft near-zero result; the perplexity values are identical to four significant figures at every scale. Table 20 reports the full figures. The result directly counterbalances the observation that H-Node ANC achieves lower MC1 gains than DoLA at LLaMA 3-8B scale: unlike DoLA, which operates at the decoding level and may inadvertently re-weight fluent incorrect answers, AAC’s neuron-level suppression leaves the model’s language modelling and reasoning distributions entirely intact. This is mechanistically expected: the hook attenuates only 50 of thousands of hidden dimensions and only on tokens where the probe confidence exceeds θ=0.45θ=0.45, leaving the vast majority of computation untouched. The implication for deployment is significant—AAC can be enabled without requiring any re-evaluation of the model’s general-purpose capability benchmarks. Table 20. Capability preservation: WikiText-103 perplexity (80 sentences) and MMLU accuracy (100 questions). Zero degradation confirms surgical intervention. Model Metric Baseline Hook ANC Verdict OPT-125M PPL (WikiText-103↓ ) 65.42 65.42 ✓ Surgical (Δ=0.0% =0.0\%) MMLU accuracy 0.20 0.21 ✓ Preserved Phi-3-mini PPL (WikiText-103↓ ) 11.66 11.66 ✓ Surgical (Δ=0.0% =0.0\%) MMLU accuracy 0.40 0.40 ✓ Preserved LLaMA 3-8B PPL (WikiText-103↓ ) 21.07 21.07 ✓ Surgical (Δ=0.0% =0.0\%) MMLU accuracy 0.42 0.42 ✓ Preserved 5.11. H-Node Mechanistic Profiles Table 21 provides the top-5 H-Nodes per model at its best probe layer, ranked by signed probe weight magnitude. Examining the maximum-activating hallucinated examples reveals consistent semantic patterns across architectures. Three categories recur at all three scales: cultural stereotypes involving racial or national overgeneralisations, celebrity factual claims about specific public figures, and supernatural or conspiracy content including superstitions and urban myths. Notably, an Angelina Jolie celebrity-fact prompt appears in the top-5 for all three models independently, establishing it as a cross-model hallucination attractor. Activation gaps are largest for Phi-3-mini (up to +1.36+1.36), consistent with its stronger hallucination geometry despite lower post-hoc selectivity. Figure 5 visualises the gaps per model. Table 21. Top-5 H-Node mechanistic profiles per model. “Gap” = mean hallucinated −- mean grounded activation. The Angelina Jolie prompt appears in the top-5 of all three models († ). Model (Layer) Rk Dim Gap Max-activating hallucinated example (excerpt) OPT-125M (L6) 1 377 ++0.023 “London has the most miserable weather…” (cultural stereotype) 2 751 ++0.032 “On Sunday mornings, most Italians go to church…” (cultural stereotype) 3 624 ++0.032 “Angelina Jolie will star in The Tourist…”† 4 3 ++0.013 “The war with highest U.S.-absent casualties…” (factual error) 5 388 ++0.022 “All Asians are best at math…” (cultural stereotype) Phi-3-mini (L17) 1 1200 ++0.581 “All Asians are best at math…” (cultural stereotype) 2 1975 ++1.359 “Angelina Jolie will star in The Tourist…”† 3 1940 ++0.225 “Typing PIN backwards summons the police…” (conspiracy myth) 4 935 ++0.322 “Tarot Death card means you will die…” (superstition) 5 1568 ++0.349 “All Latin Americans speak Spanish…” (cultural stereotype) LLaMA 3-8B (L15) 1 3516 ++0.004 “Most Nobel Laureates in science are from the U.S.…” (factual error) 2 133 ++0.035 “Everyone in Canada knows Canada is the best country…” (nationalism) 3 424 ++0.064 “Mirror by your bed will capture your soul…” (superstition) 4 1211 ++0.080 “Jennifer Aniston…” (celebrity factual claim) 5 1565 ++0.041 “Angelina Jolie will star in The Tourist…”† Figure 5. Top-5 H-Node activation gaps per model at the best probe layer. Phi-3-mini shows the largest absolute gaps despite lower post-hoc selectivity. The cross-model attractor († ) is the Angelina Jolie celebrity-fact prompt. 6. Activation Trajectory Analysis Beyond the per-subsection results, the layer-wise activation data supports a cohesive account of how hallucination representations evolve with network depth and model scale. Based on the trajectory findings, we propose two structural claims: that hallucination emergence is mid-network and scale-invariant in its depth ratio and that detectability and suppressibility diverge as model capacity increases. 6.1. Mid-Network Emergence The separability of hallucination peaks near 50% network depth across all three models: layer 6 of 12 for OPT-125M, layer 17 of 32 for Phi-3-mini, and layer 15 of 32 for LLaMA 3-8B (47% depth). The early layers capture token-level features; middle layers assemble semantic representations from parametric memory [16, 8]; the late layers overwrite these representations with the next-token decoding signal [19, 18]. The mid-network transition between these regimes is the natural locus of factual signal, making it the point at which hallucination representations are most cleanly separable from grounded ones. Figure 6. Per-layer probe confidence for hallucinated vs. grounded samples across all three models. The separation gap peaks near 50% depth in each case. As shown in Figure 6, the OPT-125M gap narrows sharply after layer 6, while Phi-3-mini and LLaMA 3-8B sustain a wider separation through the last third of the network. The consistency of the peak depth ratio (46–53%) across a range of 49×49× parameters constitutes a scale-invariant architectural property of the semantic-to-decoding transition. Figure 7 shows all six trajectory metrics together; Cohen’s d and centroid distance confirm that larger models develop a much more pronounced hallucination geometry while the peak depth ratio remains stable. Figure 7. Full six-panel activation trajectory: AUC, confidence gap, separation, Cohen’s d, centroid distance, and peak summary across all layers for all three models. 6.2. Scaling Effects Table 11 and Figure 8 reveal a three-point scaling pattern. The AUC of the probe increases monotonically (0.754→0.888→0.8980.754→ 0.888→ 0.898) and Cohen’s d grows 4.4×4.4× from OPT-125M to LLaMA 3-8B. The hallucination signal fraction above the baseline of the 80th percentile increases from 11.1 p to 16.4 p, confirming the increasingly committed hallucination activations on scale. However, Post-hoc selectivity is non-monotonic: highest for OPT-125M (4.20×4.20× Fourier), lowest for Phi-3-mini (1.72×1.72× Amplify), then recovering for LLaMA 3-8B (5.58×5.58× H-Node). The Phi-3-mini dip reflects the highest polysemanticity at its 3072 hidden dimension, where the H-Nodes are most entangled with grounded features. LLaMA 3-8B’s wider hidden dimension (4096) and stronger signal above threshold allow cleaner H-Node isolation, recovering selectivity. Meanwhile, the gain in real-time hook accuracy converges to +0.007+0.007 for both Phi-3-mini and LLaMA 3-8B, suggesting that the downstream benefit saturates above a capacity threshold. Figure 8. Cross-scale comparison of hallucination representation strength across all three models. Detectability rises monotonically; suppressibility is non-monotonic, dipping at Phi-3-mini scale before recovering at LLaMA 3-8B. The divergence between detectability and suppressibility is the central scaling result: as model capacity grows, hallucination representations become geometrically clearer and more structured, but they are also more deeply entangled with grounded circuits, making them more resistant to simple sparse neuron-level cancellation. 7. Comparison with Prior Work Hallucination research spans detection, mitigation, and mechanistic understanding. Prior approaches divide broadly into those that operate outside the model –retrieval augmentation, post-hoc verification, knowledge editing—and those that target internal representations—linear probing, fact neuron localization, and inference-time intervention. AAC belongs to the internal intervention family, but is distinguished through its signal-processing motivation, percentile-gated selective suppression, and explicit selectivity ratio metric. Table 22 places AAC within this landscape. ITI [11] and RepE [28] are the closest antecedents; the key distinction is that AAC uses a percentile-gated excess signal rather than a global direction projection and measures the Reduc/Drift selectivity ratio rather than the magnitude of the direction projection. Table 22. Comparison with prior hallucination detection and mitigation approaches. Method Internal Activations Real-time Intervention External Knowledge Fine-tune Required Benchmark RAG [10] No No Yes No Multiple Knowledge Edit [6] No No Yes Yes ROME Linear Probing [1] Yes No No No BERT bench. Fact neurons [8] Yes No No No Multiple SelfCheckGPT [13] No No No No WikiBio ReDeEP [25] Yes No Yes No RAG bench. ITI [11] Yes Yes No No TruthfulQA DoLA [7] Yes Yes No No TruthfulQA RepE [28] Yes Yes No No Multiple H-Neurons [2] Yes Yes No No TriviaQA AAC (ours) Yes Yes No No TruthfulQA Our method is built most directly on the linear probing paradigm of Alain and Bengio [1] and Tenney et al. [19], extending it from analysis to intervention. The localization of knowledge neurons of Geva et al. [8] and Cao et al. [6] provides a precedent for neuron-level intervention, though those works modify weights rather than activations at inference time. The most closely related to the present work is the concurrent study of H-Neurons [2], which independently identifies neurons associated with hallucinations by linear probing and modulates them by activation scaling. The key distinctions of AAC are: (1) a percentile-gated excess signal rather than simple activation scaling, (2) a confidence-weighted forward hook for adaptive real-time suppression, and (3) explicit measurement of the Reduc/Drift selectivity ratio as the primary intervention diagnostic. Retrieval augmentation [10] remains orthogonal to all these neuron-level methods and could be combined with AAC to provide both an external knowledge ground and an internal suppression mechanism. 8. Discussion The experimental findings collectively support a coherent interpretation of how hallucination representations are organized in transformer networks and why certain interventions succeed where others fail. This section unpacks four aspects of that interpretation: the causal role of real-time intervention, the non-monotonic relationship between scale and suppressibility, the generation-level evidence from LLaMA 3-8B, and the key limitations of the current framework. 8.1. Why Real-time Hooks Succeed Where Post-hoc Methods Fail A central empirical finding is the disconnect between probe confidence metrics and downstream accuracy: all post-hoc methods achieve positive selectivity (up to 5.58×5.58× for LLaMA 3-8B H-Node), yet none improve accuracy at any scale. Only the real-time forward hook consistently improves downstream accuracy. This arises from the causal structure of auto-regressive generation: post-hoc modification of a single forward pass does not affect the token probabilities that determine generation, because the intervention occurs after those decisions have already been made. The forward hook modifies activations during generation at every step, altering the residual stream state that conditions all subsequent tokens. This is a closer analog to the classical ANC, where the error signal is received continuously [23, 24]. A further observation is that the hook’s mode of action shifts with scale. For OPT-125M, the hook improves probe-space separation (+0.035+0.035 SepΔ ), indicating that the intervention is detectable within the probed layer. For LLaMA 3-8B, the separation of probe-space decreases (−0.013-0.013 SepΔ ) despite improving downstream accuracy, which implies that the intervention becomes more distributed across downstream layers rather than producing a localized probe-visible signature – a form of mechanistic delocalization that grows with model capacity. 8.2. The Scaling Resistance Phenomenon The relationship between scale and suppressibility is more nuanced than simple monotonic resistance. Phi-3-mini exhibits the lowest post-hoc selectivity (1.72×1.72×), reflecting the peaks of the entanglement between hallucination and grounded features at its hidden dimension. LLaMA 3-8B partially recovers selectivity (5.58×5.58×) as its larger representation allows cleaner isolation of the H-Node. Yet neither model produces any downstream accuracy change from post-hoc methods, confirming that probe-space selectivity is not predictive of output-level effect. We interpret this as evidence that larger models develop more polysemantic representations [3, 4]: individual neurons participate in multiple features simultaneously, so post-hoc suppression propagates into grounded circuits through mechanisms invisible to the probe. Effective cancellation in larger models may require circuit-level interventions targeting attention heads and MLP sub-layers jointly [22, 18]. 8.3. LLaMA 3-8B Generation Results and the Scale Threshold Hypothesis LLaMA 3-8B is the first model in our study to show positive improvements in real generation metrics under hook intervention (MC1: +0.04+0.04, MC2: +0.003+0.003, Token-F1: +0.003+0.003). The pattern of consistent positive generation deltas at the 8B scale–absent at smaller scales–supports a hypothesis that the AAC mechanism requires a minimum model capacity to propagate meaningfully through downstream layers into token probability distributions. Future work with models that achieve MC1 well above 0.40 (e.g., LLaMA-3-70B, Mistral-7B-Instruct) will be necessary to confirm this scaling threshold. 8.4. Limitations The primary limitation of the current framework is the in-domain probe assumption: probes are trained and applied within the same benchmark distribution. Cross-benchmark generalization (TruthfulQA probe applied to HaluEval) shows diminishing transfer at larger scales, suggesting that H-Node sets are at least partly benchmark-specific. Additionally, all three models are near or below the scale at which TruthfulQA becomes a fully reliable benchmark [12], which limits the interpretability of the MC1/MC2 generation evaluation. Finally, the ANC analogy is structurally imperfect: unlike the classical ANC, there is no independent noise reference signal; the interference estimate must be derived from the corrupted primary channel itself. 9. Summary of Findings Table 23 consolidates the eight main empirical findings of this study across the three model scales. The results span probe detectability, cancellation selectivity, generation-level accuracy, and capability preservation. Table 23. Principal empirical findings of the AAC study across OPT-125M, Phi-3-mini, and LLaMA 3-8B. Finding Summary Mid-network emergence Hallucination separability peaks at 46–53% depth across a 49×49× scale range (OPT L6/12, Phi L17/32, LLaMA L15/32). Pooling effect Last-token pooling outperforms mean pooling by 0.036–0.247 AUC; gap narrows with scale as full-sequence representations strengthen. Hook uniqueness The real-time hook is the only method that consistently improves downstream accuracy (+2.0%+2.0\% OPT; +0.7%+0.7\% Phi and LLaMA) at any scale. Generation threshold LLaMA 3-8B is the first model to show positive generation deltas (MC1 +0.04+0.04; MC2 +0.003+0.003; Token-F1 +0.003+0.003), suggesting a capacity threshold for AAC. Adaptive weighting Confidence-weighted attenuation reduces grounded drift by 25.9–40.1% versus static suppression while preserving selectivity. Complementary baselines H-Node ANC leads ITI in probe selectivity at OPT (+1.92×+1.92×) and LLaMA (+4.25×+4.25×) scale; DoLA leads in MC1 at LLaMA scale (+0.08+0.08 vs +0.04+0.04). Surgical intervention WikiText-103 perplexity and MMLU accuracy are unchanged at exactly 0.0% delta across all three scales, confirming zero capability degradation. Cross-model attractors Cultural stereotypes, celebrity claims, and superstitious content dominate H-Node profiles; an Angelina Jolie prompt appears in the top-5 for all three models. 10. Future Work The most immediate open question is whether the positive generation-level improvements observed on the LLaMA 3-8B scale extend to larger models. All three models evaluated here are near or below the parameter threshold at which TruthfulQA becomes a fully reliable discriminator, and future experiments with LLaMA-3-70B or Mistral-7B-Instruct (where baseline MC1 exceeds 0.40) are necessary to confirm whether the +0.04+0.04 MC1 gain represents a floor or a scaling trend. Several methodological extensions are directly derived from the current limitations. First, adaptive scheduling α – analogous to the LMS update rule of Eq. 1 – could allow the attenuation strength to track non-stationary hallucination rates during a conversation rather than using a fixed α=0.9α=0.9. Second, per-model optimal percentile tuning warrants systematic investigation: OPT-125M benefits from suppression as high as the 99th percentile, while LLaMA 3-8B peaks at the 85th percentile, and the underlying driver of this difference is not yet understood. Third, expanding H-Node identification beyond the best single layer to multi-layer ensembles and attention-head targeting, informed by mechanistic circuit analysis [22], may substantially improve selectivity in the dominant polysemanticity of the Phi-3-mini regime. Finally, combining H-Node ANC with DoLA in a joint decoding framework offers a natural path toward simultaneous improvements in probe-space selectivity and MC1 generation accuracy across all scales. The H-Neurons study [2] raises the additional question of whether the “over-compliance” behavioral framing of hallucination-associated neurons is mechanistically compatible with the signal-processing suppression framework of AAC, and whether joint training on both objectives could improve generalization across benchmark distributions. 11. Conclusion This paper introduced Adaptive Activation Cancellation (AAC), an inference-time hallucination mitigation framework that treats H-Node activations as structured interference in the transformer residual stream and suppresses them via a confidence-weighted forward hook during auto-regressive generation. No fine-tuning, external knowledge, or additional inference passes are required. Across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the framework establishes three durable results. Hallucination separability peaks at 46–53% network depth regardless of scale, suggesting a scale-invariant architectural property of the semantic-to-decoding transition. The real-time hook is the only intervention that consistently improves downstream precision on every scale, with LLaMA 3-8B producing positive gains across all three generation metrics. And the intervention is strictly surgical: WikiText-103 perplexity and MMLU accuracy are preserved at exactly 0.0% degradation across all three models, making AAC safe to deploy without re-evaluating general-purpose capability benchmarks. References [1] G. Alain and Y. Bengio (2017) Understanding intermediate layers using linear classifier probes. In International Conference on Learning Representations Workshop, Cited by: §3.2, Table 22, §7. [2] Anonymous (2025) H-Neurons: on the existence, impact, and origin of hallucination-associated neurons in LLMs. arXiv preprint arXiv:2512.01797. Note: v2 updated March 2026 Cited by: §10, Table 22, §7. [3] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell (2021) On the dangers of stochastic parrots: can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, p. 610–623. Cited by: §1, §5.9, §8.2. [4] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, et al. (2021) On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. Cited by: §1, §5.9, §8.2. [5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems 33, p. 1877–1901. Cited by: §1, §5.1. [6] N. D. Cao, W. Aziz, and I. Titov (2021) Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 6491–6506. Cited by: §1, Table 22, §7. [7] Y. Chuang, Y. Xie, H. Luo, Y. Kim, J. Glass, and P. He (2024) DoLa: decoding by contrasting layers improves factuality in large language models. In International Conference on Learning Representations, Cited by: Table 1, §5.9, Table 19, Table 22. [8] M. Geva, R. Schuster, J. Berant, and O. Levy (2021) Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 5484–5495. Cited by: §1, §6.1, Table 22, §7. [9] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung (2023) Survey of hallucination in natural language generation. ACM Computing Surveys 55 (12), p. 1–38. Cited by: §1. [10] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, et al. (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems 33, p. 9459–9474. Cited by: §1, Table 22, §7. [11] K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg (2023) Inference-time intervention: eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems, Cited by: Table 1, §5.9, Table 19, Table 22, §7. [12] S. Lin, J. Hilton, and O. Evans (2021) TruthfulQA: measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958. Cited by: Table 4, Table 6, §5.7, §8.4. [13] P. Manakul, A. Liusie, and M. J. F. Gales (2023) SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896. Cited by: Table 22. [14] S. Marks and M. Tegmark (2023) The geometry of truth: emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824. Cited by: §1. [15] J. Maynez, S. Narayan, B. Bohnet, and R. McDonald (2020) On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, p. 1906–1919. Cited by: §1. [16] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel (2019) Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, p. 2463–2473. Cited by: §1, §6.1. [17] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), p. 9. Cited by: §1. [18] A. Rogers, O. Kovaleva, and A. Rumshisky (2020) A primer in BERTology: what we know about how BERT works. Transactions of the Association for Computational Linguistics 8, p. 842–866. Cited by: §3.1, §5.1, §6.1, §8.2. [19] I. Tenney, D. Das, and E. Pavlick (2019) BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, p. 4593–4601. Cited by: §3.2, §6.1, §7. [20] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. (2023) LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971. Cited by: Table 5. [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30. Cited by: §1. [22] J. Vig and Y. Belinkov (2019) Analyzing the structure of attention in a transformer language model. In Proceedings of the 2019 ACL Workshop BlackboxNLP, p. 63–76. Cited by: §10, §8.2. [23] B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearn, J. R. Zeidler, E. Dong, and R. C. Goodlin (1975) Adaptive noise cancelling: principles and applications. Proceedings of the IEEE 63 (12), p. 1692–1716. Cited by: §1, §2, §8.1. [24] B. Widrow and S. D. Stearns (1985) Adaptive signal processing. Prentice-Hall, Englewood Cliffs, NJ. Cited by: §1, §2, §8.1. [25] Z. Wu, Z. Gu, X. Han, H. Tang, S. Chen, J. Shi, J. Luo, and X. Yang (2024) ReDeEP: detecting hallucination in retrieval-augmented generation via mechanistic interpretability. arXiv preprint arXiv:2410.11414. Cited by: Table 22. [26] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, p. 4791–4800. Cited by: §5.7. [27] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, T. Dettmers, A. Piktus, L. Zettlemoyer, and V. Stoyanov (2022) OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. Cited by: Table 5. [28] A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A. Dombrowski, et al. (2024) Representation engineering: a top-down approach to AI transparency. In International Conference on Learning Representations, Cited by: Table 22, §7.