Paper deep dive

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

Mohammad Beigi, Ming Jin, Junshan Zhang, Qifan Wang, Lifu Huang

Year: 2025Venue: arXiv preprintArea: Deception & FailureType: EmpiricalEmbeddings: 48

Models: Llama-2-7B

Abstract

Abstract:Reinforcement Learning from Human Feedback (RLHF) remains vulnerable to reward hacking, where models exploit spurious correlations in learned reward models to achieve high scores while violating human intent. Existing mitigations rely on static defenses that cannot adapt to novel exploitation strategies. We propose Adversarial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game. ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations; second, Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking from an unobservable failure into a measurable, controllable signal. Experiments across three hacking scenarios demonstrate that ARA achieves the best alignment-utility tradeoff among all baselines: reducing sycophancy to near-SFT levels while improving helpfulness, decreasing verbosity while achieving the highest ROUGE-L, and suppressing code gaming while improving Pass@1. Beyond single-domain evaluation, we show that reward hacking, detection, and mitigation all generalize across domains -- a Hacker trained on code gaming exhibits increased sycophancy despite no reward for this behavior, and an Auditor trained on one domain effectively suppresses exploitation in others, enabling efficient multi-domain defense with a single model.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/11/2026, 12:40:56 AM

Summary

The paper introduces Adversarial Reward Auditing (ARA), a two-stage framework to mitigate reward hacking in RLHF. In Stage 1, a Hacker policy and an Auditor network are trained in a competitive game to discover and detect reward model vulnerabilities. In Stage 2, the trained Auditor is used to gate reward signals during RLHF, penalizing exploitative responses. Experiments show ARA effectively reduces sycophancy, length bias, and code gaming while maintaining utility, with evidence that both hacking and mitigation strategies generalize across domains.

Entities (5)

Adversarial Reward Auditing · framework · 100%Auditor-Guided RLHF · method · 100%Reward Hacking · problem · 100%Code Gaming · failure-mode · 95%Sycophancy · failure-mode · 95%

Relation Signals (3)

Adversarial Reward Auditing → mitigates → Reward Hacking

confidence 100% · ARA mitigates reward hacking in RLHF by making exploitation detectable and thus unprofitable.

Auditor-Guided RLHF → uses → Auditor

confidence 100% · AG-RLHF gates the reward signal based on the Auditor’s exploitation probability

Reward Hacking → manifestsas → Sycophancy

confidence 90% · Known manifestations of reward hacking include... sycophancy that prioritizes agreement over accuracy

Cypher Suggestions (2)

Find all failure modes mitigated by the ARA framework · confidence 95% · unvalidated

MATCH (f:Framework {name: 'Adversarial Reward Auditing'})-[:MITIGATES]->(p:Problem) RETURN p.name

Identify components of the ARA framework · confidence 90% · unvalidated

MATCH (a:Framework {name: 'Adversarial Reward Auditing'})-[:HAS_COMPONENT]->(c) RETURN c.name

Full Text

47,918 characters extracted from source content.

Expand or collapse full text

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking Mohammad Beigi Ming Jin Junshan Zhang Qifan Wang Lifu Huang Abstract Reinforcement Learning from Human Feedback (RLHF) remains vulnerable to reward hacking, where models exploit spurious correlations in learned reward models to achieve high scores while violating human intent. Existing mitigations rely on static defenses that cannot adapt to novel exploitation strategies. We propose Adversarial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game. ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations; second, Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking from an unobservable failure into a measurable, controllable signal. Experiments across three hacking scenarios demonstrate that ARA achieves the best alignment-utility tradeoff among all baselines: reducing sycophancy to near-SFT levels while improving helpfulness, decreasing verbosity while achieving the highest ROUGE-L, and suppressing code gaming while improving Pass@1. Beyond single-domain evaluation, we show that reward hacking, detection, and mitigation all generalize across domains—a Hacker trained on code gaming exhibits increased sycophancy despite no reward for this behavior, and an Auditor trained on one domain effectively suppresses exploitation in others, enabling efficient multi-domain defense with a single model. Machine Learning, ICML 1 Introduction Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) has become a dominant paradigm for aligning Large Language Models (LLMs) with human preferences. In typical RLHF pipelines, a reward model (RM) is first trained from human-ranked response pairs to approximate preference judgments, and the LLM policy is then optimized to maximize this learned reward. However, since the reward models are usually imperfect proxies for true human intent, this pipeline suffers from a fundamental vulnerability: reward hacking, where models exploit spurious correlations in the reward model to achieve high scores while violating genuine human preferences (Gao et al., 2023; Geirhos et al., 2020; Amodei et al., 2016; Everitt et al., 2021; Liu et al., 2024). Known manifestations of reward hacking include length bias (i.e., overly verbose but inaccurate generations) (Chen et al., 2024; Gao et al., 2023), “gaming” in coding tasks (e.g., modifying unit tests rather than solving problems) (Baker et al., 2025; MacDiarmid et al., 2025), and sycophancy that prioritizes agreement over accuracy (Denison et al., 2024; Beigi et al., 2025). Importantly, recent work (MacDiarmid et al., 2025) demonstrates that reward hacking may generalize beyond isolated settings: once a model learns to exploit a proxy reward in one domain, it can acquire more strategic misaligned behaviors, such as appearing compliant under monitoring while pursuing misaligned objectives when unobserved. There has been a growing body of work that seeks to mitigate reward hacking, broadly categorized into three directions: (1) Regularization methods, which constrain optimization through KL divergence penalties (Chen et al., 2024), or information-theoretic constraints that filter task-irrelevant features (Miao et al., 2024); (2) Reward model improvements, such as scaling reward models (Gao et al., 2023), employing ensemble methods for more robust preference estimation (Liu et al., 2024; Eisenstein et al., 2023), and reward clipping (Engstrom et al., 2020; Zheng et al., 2023; Chen et al., 2024); and (3) Bias-specific interventions, which directly penalize known artifacts such as excessive response length (Singhal et al., 2024; Chen et al., 2024). While these strategies mitigate specific failure modes, they share a fundamental limitation: they are largely static defenses against an inherently dynamic problem. In particular, they provide no explicit mechanism to audit whether a high reward reflects genuine task quality or exploitation of spurious correlations in the reward model. Furthermore, these approaches can only suppress previously identified hacking patterns, but cannot detect or adapt to sophisticated, emergent forms of reward exploitation that arise during optimization. Motivated by this perspective, we propose Adversarial Reward Auditing (ARA), a framework that formulates reward hacking as a competitive two-player game between a Hacker and an Auditor. As illustrated in Figure 1, ARA operates in two stages: In Stage 1 (Hacker-Auditor Game), we train two competing components against a frozen reward model. The Hacker is a language model (initialized from supervised fine-tuning) that learns to generate responses with high proxy reward through exploitation. The Auditor is a classifier that operates on the reward model’s internal representations and learns to distinguish genuinely high-quality responses from exploitative ones. These components are trained adversarially: the Hacker is incentivized to evade detection while maximizing reward, and the Auditor continually updates to recognize the Hacker’s evolving strategies. This co-evolution exposes the Auditor to hard, adaptive adversarial examples, preventing overfitting to simple hacking patterns. In Stage 2 (Auditor-Guided RLHF (AG-RLHF)), we deploy the trained Auditor to guide standard RLHF training. Rather than using the proxy reward directly, AG-RLHF gates the reward signal based on the Auditor’s exploitation probability: responses flagged as exploitative receive suppressed rewards, making hacking strategies unprofitable. This transforms reward hacking from an unobservable failure into a measurable signal that actively shapes policy optimization. This two-stage design provides two key benefits. First, the competitive dynamics create a self-improving mechanism where the Hacker continuously exposes reward model vulnerabilities, providing exactly the challenging cases needed to strengthen detection. Second, unlike static defenses against predetermined failure modes, our framework actively discovers emerging forms of reward hacking, adapting to novel exploitation strategies as they develop during training. Building on these advantages, ARA mitigates reward hacking in RLHF by making exploitation detectable and thus unprofitable. Figure 1: Adversarial Reward Auditing (ARA) trains a Hacker to exploit a frozen proxy reward model while an Auditor learns to detect reward-hacked outputs. The Auditor then gates the reward used for RL, making exploitation detectable and unprofitable. We conduct extensive experiments across three reward hacking scenarios including sycophancy (Sharma et al., 2025; Denison et al., 2024), length bias (Chen et al., 2024; Gao et al., 2023), and code gaming (MacDiarmid et al., 2025; Baker et al., 2025), demonstrating that ARA consistently achieves the best alignment-utility tradeoff by reducing hacking while maintaining task performance, compared to all existing strong baselines. Furthermore, our analysis reveals that reward hacking generalizes across domains: exploits learned in one setting transfer to others. A hacker trained only on code gaming exhibits 22.5% higher sycophancy despite receiving no reward for this behavior, with cross-domain transfer accelerating precisely when in-domain exploitation saturates. Critically, we find that ARA’s detection and mitigation capabilities also generalize: the Auditor’s ability to detect and suppress exploits transfers across domains. Beyond single-domain evaluation, we demonstrate that both reward hacking and its mitigation generalize across domains. Our contributions are summarized as follows: • We propose ARA, the first framework that formulates reward hacking as a competitive two-player game, jointly training a hacker to discover reward exploits and an auditor to detect them, transforming reward hacking from an unobservable failure into a measurable, controllable signal. • We demonstrate through extensive experiments on multiple benchmarks that our framework detects and significantly mitigates reward hacking compared to existing methods, achieving better alignment with human preferences. • We show that reward hacking, detection, and mitigation all generalize across domains: a Hacker trained on one domain exhibits exploitation in others, and an Auditor trained on a single domain effectively suppresses hacking across multiple domains. 2 Related Works Reward Hacking in Large Language Models. Reward hacking, which presents a prominent challenge in alignment and safe deployment of AI systems, occurs when policy models exploit proxy reward models to achieve high scores without fulfilling true human objectives (Gao et al., 2023; Geirhos et al., 2020; Amodei et al., 2016; Everitt et al., 2021; Wen et al., 2024). This stems from reward misgeneralization, where RMs trained on finite preference datasets serve as imperfect proxies for human preference, incorrectly associating rewards with spurious features that correlate with training preferences but not actual quality (Gao et al., 2023; Liu et al., 2024; Skalse et al., 2025). As policies optimize these flawed proxies, proxy metrics increase while true alignment deteriorates (Skalse et al., 2025; Miao et al., 2024; Amodei et al., 2016). Common known manifestations include length bias, (Chen et al., 2024), sycophancy (Denison et al., 2024; Beigi et al., 2025), and coding exploitation (Baker et al., 2025). As model capabilities advance, reward hacking evolves from passive exploitation to sophisticated strategic behavior, necessitating adaptive rather than static solutions (Taylor et al., 2025; Denison et al., 2024; MacDiarmid et al., 2025). Mitigating Reward Hacking in Large Language Models. Current mitigation approaches treat reward hacking as an optimization problem to constrain rather than an adversarial challenge to counter. Regularization-based methods, primarily KL divergence penalties (Chen et al., 2024) and information-theoretic constraints that filter task-irrelevant features (Miao et al., 2024). Reward model improvements through scaling (Gao et al., 2023) and ensembles (Eisenstein et al., 2023) attempt to create more robust reward signals, yet scaling up network size or quantity presents limited feasibility and may incur significant costs, especially for models with billions of parameters. Targeted debiasing approaches (Singhal et al., 2024; Chen et al., 2024) address specific known biases such as response length, but remain limited to predetermined failure modes and cannot generalize to novel exploitation strategies. Our approach is distinct from existing methods since it specifically targets reward misgeneralization through an adversarial auditor that actively probes for and exposes spurious correlations as they emerge during training. Furthermore, it provides automatic detection of whether high-scoring responses achieve their rewards through genuine quality or exploitation, transforming reward hacking from an unobservable failure into a measurable signal. 3 ARA: Adversarial Reward Auditing We operate within the standard RLHF setting. Given an input prompt x and a model-generated response y, a reward model Rθ(x,y)R_θ(x,y) assigns a scalar score intended to reflect how well y aligns with human preferences for x. In practice, RθR_θ is generally trained on a human-preference dataset pref=(x(i),yw(i),yl(i))i=1ND_pref=\(x^(i),y_w^(i),y_l^(i))\_i=1^N, where ywy_w and yly_l denote preferred and dispreferred responses for the same prompt x. The reward model thus serves as a learned proxy for an inaccessible ground-truth preference function R∗R^*. RLHF then optimizes a policy πϕ _φ, initialized from a supervised fine-tuned model πSFT _SFT, via RLHF(ϕ)=x∼,y∼πϕ[Rθ(x,y)−βDKL(πϕ∥πSFT)].J_RLHF(φ)=E_x ,y _φ [R_θ(x,y)-β D_KL( _φ\| _SFT) ]. Reward hacking occurs when πϕ _φ exploits misspecifications in RθR_θ such that Rθ(x,y)R_θ(x,y) increases while R∗(x,y)R^*(x,y) decreases. We introduce Adversarial Reward Auditing (ARA), a framework that formulates reward hacking as a competitive two-player game. ARA operates in two stages: Stage 1 (Hacker-Auditor Game), conducted prior to policy optimization, where we train a Hacker to discover exploits in the frozen reward model RθR_θ while an Auditor learns to detect these exploits from the reward model’s internal representations. The Hacker is initialized from the same supervised fine-tuned model πSFT _SFT that will later be optimized in Stage 2, ensuring it explores the same exploitation strategies the policy would naturally discover during RLHF. Once the Auditor is trained, we proceed to Stage 2 (Auditor-Guided RLHF), where the trained Auditor gates reward signals during standard policy optimization—responses flagged as exploitative receive suppressed rewards, making hacking strategies unprofitable. Throughout both stages, the reward model RθR_θ remains frozen, ensuring the Auditor is calibrated to a fixed feature space. 3.1 Stage 1: The Hacker-Auditor Game We formulate an adversarial game between a Hacker policy HψH_ψ and an Auditor network AξA_ξ. The Hacker, initialized from πSFT _SFT, learns to generate responses that achieve high proxy rewards by exploiting spurious correlations in RθR_θ. The Auditor learns to distinguish genuinely aligned responses from exploitative ones. 3.1.1 The Auditor The Auditor AξA_ξ is designed to distinguish genuinely aligned responses from those that exploit RθR_θ. Since both may yield high rewards, discrimination requires analyzing the internal mechanism of reward generation rather than reward values alone. Our key insight is that exploitation manifests distinctly in the reward model’s latent space: models trained on finite preference data inevitably encode both task-relevant features and spurious correlates, and exploitative responses activate the latter disproportionately. Let RθR_θ decompose into a feature extractor fθf_θ and a reward head, and define hx,y=fθ(x,y)∈ℝdh_x,y=f_θ(x,y) ^d as the penultimate layer activation. The Auditor is a multilayer perceptron Aξ:ℝd→[0,1]A_ξ:R^d→[0,1] that takes hx,yh_x,y as input and estimates P(genuine∣hx,y)P(genuine h_x,y), the probability that response y reflects genuine alignment rather than exploitation. Training the Auditor requires labeled examples of both classes. Positive samples +=(xi,yiw)∈prefD^+=\(x_i,y^w_i) _pref\ consist of preferred responses from the original preference dataset, representing the ground-truth distribution that RθR_θ was trained to model. Negative samples −D^- are drawn from two sources: responses generated by the current Hacker HψH_ψ, and historical exploits stored in a replay buffer ℬB. While both +D^+ and −D^- contain high-reward responses, they achieve reward through different activation patterns: +D^+ activates features aligned with human judgment, whereas −D^- increasingly activates spurious features as the Hacker learns to exploit. This activation-space divergence is initially small but grows throughout training, providing the Auditor with progressively stronger signal. We maintain ℬB as a fixed-size buffer, adding only Hacker responses that achieve high proxy reward (Rθ>τR_θ> _R) and successfully evade the current Auditor (Aξ>τA_ξ> _A). When the buffer reaches capacity, we remove examples that the Auditor now easily detects, keeping the buffer focused on hard negatives. During training, we sample negatives as a mixture: (1−α)(1-α) from the current Hacker and α from ℬB, with buffer samples prioritized by Auditor confidence to focus on cases the Auditor still fails to detect. To prevent the Auditor from learning superficial heuristics rather than exploitation signatures, all prompts for −D^- are sampled from the same distribution as +D^+. The Auditor minimizes: ℒA(ξ;ψ)=ℒBCE(ξ)+λCℒCon(ξ)L_A(ξ;ψ)=L_BCE(ξ)+ _CL_Con(ξ), where the dependence on ψ arises because −D^- contains samples from the current Hacker HψH_ψ. The binary cross-entropy loss is: ℒBCE(ξ) _BCE(ξ) =−(x,y+)∼+[log⁡Aξ(hx,y+)] =-E_(x,y^+) ^+ [ A_ξ(h_x,y^+) ] −(x,y−)∼−[log⁡(1−Aξ(hx,y−))] -E_(x,y^-) ^- [ (1-A_ξ(h_x,y^-)) ] The supervised contrastive loss operates on normalized embeddings zi=gξ(hxi,yi)/‖gξ(hxi,yi)‖z_i=g_ξ(h_x_i,y_i)/\|g_ξ(h_x_i,y_i)\| from the Auditor’s penultimate layer gξg_ξ, which transforms the reward model representations h=fθ(x,y)h=f_θ(x,y) into a space optimized for exploitation detection: ℒCon(ξ)=∑i∈ℬ−1|S(i)|∑s∈S(i)log⁡exp⁡(zi⋅zs/τ)∑a≠iexp⁡(zi⋅za/τ)L_Con(ξ)= _i -1|S(i)| _s∈ S(i) (z_i· z_s/τ) _a≠ i (z_i· z_a/τ) where ℬB is a mini-batch, S(i)=s∈ℬ:s≠i,cs=ciS(i)=\s :s≠ i,c_s=c_i\ is the set of samples sharing the same class label ci∈genuine,exploitc_i∈\genuine,exploit\ as sample i, and τ is a temperature hyperparameter. The contrastive term encourages geometrically structured representations where genuine responses cluster together and exploits cluster separately, improving robustness to the Hacker’s evolving strategies. 3.1.2 The Hacker The Hacker HψH_ψ, initialized from πSFT _SFT, is trained to discover exploits in the reward model by maximizing proxy reward while evading Auditor detection. We optimize: H(ψ;ξ) _H(ψ;ξ) =x∼,y∼Hψ(⋅|x)[Rθ(x,y)+λAlogAξ∗(hx,y) =E_x ,y H_ψ(·|x) [R_θ(x,y)+ _A A_ξ^*(h_x,y) −βHDKL(Hψ∥πSFT)] 18.49988pt 18.49988pt 18.49988pt - _HD_KL(H_ψ\| _SFT) ] using Proximal Policy Optimization (PPO), where Rθ(x,y)+λAlog⁡Aξ∗(hx,y)R_θ(x,y)+ _A A_ξ^*(h_x,y) serves as the reward signal. The three terms serve complementary purposes. The proxy reward Rθ(x,y)R_θ(x,y) incentivizes the Hacker to find high-reward responses, including those that exploit spurious correlations. The evasion term λAlog⁡Aξ∗(hx,y) _A A_ξ^*(h_x,y) penalizes responses that the Auditor identifies as exploitative: since Aξ∗∈[0,1]A_ξ^*∈[0,1] estimates the probability of genuine alignment, log⁡Aξ∗ A_ξ^* is negative when the Auditor suspects exploitation and approaches zero for responses deemed genuine. The KL regularization βHDKL(Hψ∥πSFT) _HD_KL(H_ψ\| _SFT) constrains the Hacker to remain within a plausible language distribution, preventing degenerate outputs that trivially achieve high reward. The target Auditor Aξ∗A_ξ^* is a Polyak-averaged copy of the Auditor, updated as ξ∗←ρξ∗+(1−ρ)ξ^*←ρξ^*+(1-ρ)ξ after each Auditor update with ρ=0.995ρ=0.995. This exponential moving average provides a stable optimization target: the Hacker optimizes against a slowly-evolving Auditor rather than the current one, preventing oscillatory dynamics where the Hacker overfits to transient Auditor states. 3.1.3 Game Dynamics and Stabilization The coupled optimization—RL for the Hacker, supervised learning for the Auditor—requires careful stabilization: if the Auditor advances too quickly, the Hacker receives uninformative gradients; if too slowly, it fails to detect emerging exploits. We employ three mechanisms to balance learning dynamics. Two-Phase Update Schedule. We adopt a two-phase schedule to balance learning speeds. In Phase 1 (warmup), for the first TwarmT_warm steps, we update the Auditor at fixed intervals—once every K Hacker PPO updates—allowing both players to establish initial strategies. In Phase 2 (confidence-gated), we switch to adaptive updates: let A¯ A denote the moving average of Auditor confidence on recent Hacker outputs over a window of m steps. We update the Auditor only when A¯>τgate A> _gate, indicating the Hacker is successfully evading detection; otherwise, the Auditor remains frozen to let the Hacker catch up. This gating prevents the Auditor from overpowering the Hacker prematurely. Replay Buffer. We maintain a buffer ℬB of historical exploits, sampling negatives as a mixture: (1−α)(1-α) from the current Hacker and α from ℬB. This implements a form of fictitious play—the Auditor learns to detect the average historical strategy, not just the current one—preventing catastrophic forgetting of past vulnerabilities. Target Network. The Hacker optimizes against a Polyak-averaged target Auditor Aξ∗A_ξ^*, updated as ξ∗←ρξ∗+(1−ρ)ξ^*←ρξ^*+(1-ρ)ξ with ρ=0.995ρ=0.995. This smooths the Hacker’s learning signal, preventing overfitting to transient Auditor states. 3.2 Stage 2: Auditor-Guided RLHF (AG-RLHF) Once Stage 1 produces a trained Auditor AξA_ξ that reliably distinguishes genuine quality from exploitation, we deploy it to guide standard RLHF training. Rather than optimizing the proxy reward RθR_θ directly, AG-RLHF gates the reward signal based on the Auditor’s confidence that a response reflects genuine alignment: Rgated(x,y)=Rθ(x,y)⋅Aξ(hx,y)γR_gated(x,y)=R_θ(x,y)· A_ξ(h_x,y)^γ where γ>0γ>0 controls gating severity. When the Auditor is confident that a response is genuine (Aξ≈1A_ξ≈ 1), the gated reward approaches the full proxy reward. When the Auditor detects exploitation (Aξ≈0A_ξ≈ 0), the gated reward is suppressed toward zero regardless of how high the proxy reward is. The exponent γ modulates this tradeoff: higher values impose stricter penalties on suspected exploits, while lower values are more permissive. The policy πϕ _φ, initialized from πSFT _SFT, is optimized via: AG-RLHF(ϕ)=x∼,y∼πϕ[Rgated(x,y)−βDKL(πϕ∥πSFT)]J_AG-RLHF(φ)=E_x ,y _φ [R_gated(x,y)-β D_KL( _φ\| _SFT) ] using PPO. The Auditor remains frozen throughout Stage 2. This separation is deliberate: the Auditor was trained adversarially against a Hacker explicitly optimizing for evasion, making it robust to the milder exploitation pressures that arise during standard policy optimization. Freezing also ensures stable reward signals—if the Auditor continued adapting, the policy would chase a moving target, destabilizing training. We provided the details of hyperparameter selection in Appendix A.1. 4 Experimental Setup Tasks and Datasets. To comprehensively evaluate ARA, we cover three known reward hacking scenarios: (1) Sycophancy: We train the reward model on Anthropic H-RLHF (Bai et al., 2022) and measure Sycophancy Rate on SycophancyEval (Sharma et al., 2025), measuring whether models maintain correct answers when users express disagreement. (2) Length Bias: we follow (Chen et al., 2024) setting and measure average response length and ROUGE-L. (3) Code Gaming: We construct a coding environment enabling the reward hacking behaviors using the unit test dataset from (?), models receive problems with visible test cases and can either (i) implement general solutions or (i) exploit test visibility through hardcoding outputs, manipulating assertions. We measure alignment through GPT-4-classified Gaming Rate (fraction of solutions exploiting rather than solving) and utility through Pass@1 on held-out tests invisible during training. Baselines. We comprehensively compare against methods spanning four mitigation categories: (1) Unmitigated SFT: Supervised fine-tuned model without RLHF (pre-optimization baseline). (2) PPO-KL Regularization: Explicit KL penalty constraining drift from SFT (Chen et al., 2024; Miao et al., 2024). (3) ODIN (Chen et al., 2024): Orthogonal disentanglement via dual reward heads (4) InFoRM (Miao et al., 2024): Variational information bottleneck filtering task-irrelevant features. (5) Reward Model Improvements with Enesemble: Averages n independently trained reward models to reduce exploitable variance (Eisenstein et al., 2023; Miao et al., 2024). (6) Training Interventions: Removes high-reward outliers (R>μ+2σR>μ+2σ) following scaling law findings of (Gao et al., 2023). The SFT model (i.e., Llama-2-7B (Touvron et al., 2023; hug, ) serves as the baseline policy π0 _0. Each experiment is repeated with three random seeds; we report mean ± 95% confidence intervals. 5 Result and Discussion 5.1 Reward Hacking Mitigation with ARA Sycophancy Length Bias Code Gaming Method Sycophancy ↓ Helpfulness ↑ Avg. Length ↓ ROUGE-L ↑ Gaming Rate ↓ Pass@1 ↑ SFT (No RLHF) 36.2 41.3 148 21.4 4.2 28.5 Unmitigated PPO 72.4 76.8 347 23.1 61.3 34.2 Regularization Methods PPO w/KL 58.3 68.2 268 22.8 48.5 31.8 ODIN 51.6 63.4 195 23.4 42.1 33.5 InFoRM 47.2 62.1 208 23.2 39.8 33.1 Reward Model Improvements RM Ensemble 52.8 64.6 224 23.0 44.3 34.0 Training Interventions Filtering (R>μ+2σR>μ+2σ) 55.1 50.3 241 22.6 46.7 32.4 ARA(Ours) 38.4 77.2 162 24.1 19.6 35.8 Table 1: Main Results. Comparison of AG-RLHF against baselines across three reward hacking scenarios. We report alignment and utility metrics and utility metrics. Results averaged over 3 seeds. Figure 2: ARA mitigates Goodhart’s Law during RLHF optimization. Proxy reward (solid) is the learned reward model score; gold reward (dashed) measures true task-specific performance—GPT-4 factual accuracy for sycophancy (given the golden answer), ROUGE-L for length bias, and Pass@1 on held-out tests for code gaming. Table 1 summarizes results across all three hacking scenarios, where ARA consistently achieves the best alignment—utility tradeoff. (1) For sycophancy, standard PPO increases the rate from 36.2% (SFT) to 72.4% while improving helpfulness by 35.5%, confirming that optimizing the proxy reward improves surface metrics at the cost of true alignment. Regularization methods (PPO-KL, ODIN, InFoRM) reduce sycophancy to 47–58% but sacrifice helpfulness, whereas ARA significantly reduces sycophancy to 38.4% while improving helpfulness to 77.2%. (2) For length bias, PPO inflates response length from 148 to 347 tokens (+134%) with minimal ROUGE-L improvement (+1.7 points)—a clear signature of verbosity exploitation. ODIN and InFoRM, designed specifically for this setting, achieve 195–208 tokens, while ARA reduces length to 162 tokens and achieves the highest ROUGE-L (24.1), indicating that the Auditor learned to distinguish verbose padding from substantive content. (3) Code gaming presents the most sophisticated exploitation, where models manipulate unit tests rather than solving problems. PPO achieves a 61.3% gaming rate, and existing methods only reduce this to 40–47%. ARA achieves a 19.6% gaming rate while improving Pass@1 to 35.8%, demonstrating that suppressing exploitation redirects optimization toward genuine problem-solving. Goodhart Divergence Prevention. Figure 2 visualizes the training dynamics of proxy and task-specific performance across all three scenarios. Standard PPO exhibits the classic Goodhart pattern: proxy reward increases throughout training while gold reward peaks early and then declines, indicating that continued optimization actively degrades true performance. The timing of this divergence varies by scenario. Code gaming shows the earliest divergence at around 1,000 steps because test-gaming strategies are discovered and exploited quickly, leading to earlier saturation. Length bias diverges around step 3,500, reflecting a period where increased length initially correlates with quality before verbosity dominates. Sycophancy diverges latest at step 4,000, suggesting that agreement-seeking behavior emerges more gradually. In contrast, ARA maintains substantially better alignment between proxy and gold rewards throughout training, suggesting that Auditor-gated rewards successfully reduce reward over-optimization pressure toward genuine quality improvements rather than reward exploitation. 5.2 Hacker’s Reward Hacking Dynamics Figure 3 compares the temporal dynamics of three reward hacking behaviors, each measured during training on its respective task. During steps 0–1,500, all three metrics remain near baseline as models learn basic task competence before discovering exploitable patterns. Beginning at step 1,500, all three hacking behaviors start to emerge: code gaming rises most steeply, while chat sycophancy and length bias follow similar trajectories. This synchronized onset across independent training runs suggests that reward hacking emerges at a consistent phase of optimization, regardless of the specific exploit type. The critical transition occurs around step 4,000, where code gaming saturates at approximately 59% having exhausted available exploits, while chat sycophancy and length bias continue climbing. The parallel trajectories across distinct hacking scenarios suggest a common underlying mechanism: models first acquire task competence, then discover exploitation strategies at a characteristic optimization phase, with saturation rates varying by the complexity of available exploits. Figure 3: Reward hacking emerges across tasks 5.3 Auditor’s Performance Evaluation A core aspect of our framework is that the Auditor can distinguish whether high reward scores reflect genuine quality or exploitation. The critical test is whether the Auditor succeeds where proxy reward fails: distinguishing genuinely high-quality responses from high-reward exploits. We construct a balanced evaluation set of 500 genuine responses (from +D^+) and 500 exploitative responses (from known hacked responses), all filtered to have comparable high proxy rewards (Rθ>0.85R_θ>0.85). As shown in Table 2, we compare three detection approaches. The learned reward model cannot distinguish the two groups—by construction, both achieve similarly high scores (0.91 vs 0.89, Δ=0.02 =0.02). We also prompt GPT-4 to classify whether each response is genuine or exploitative based on the text alone, outputting a score from 0 (likely exploit) to 1 (likely genuine). GPT-4 achieves moderate separation (Δ=0.44 =0.44), confirming that exploits are detectable from surface features but not trivially so. The Auditor achieves the strongest separation (Δ=0.63 =0.63): on average, it assigns probability 0.82 to genuine responses and only 0.19 to exploits, indicating that the Auditor is highly confident in distinguishing the two groups. Detection Method Genuine Exploit Δ Proxy Reward (RθR_θ) 0.91 0.89 0.02 GPT-4 0.78 0.34 0.44 Auditor Confidence 0.82 0.19 0.63 Table 2: Genuine vs. exploitative response detection. Mean scores (probability) for 500 genuine responses and 500 high-reward exploits, all with comparable proxy rewards (Rθ>0.85R_θ>0.85). 5.4 Cross-Domain Generalization of Reward Hacking 5.4.1 Reward Model Misalignment Structure Figure 4 visualizes reward model representations hx,yh_x,y for exploitative responses across three hacking types using t-SNE. The visualization reveals unexpected structure: chat sycophancy and length bias occupy nearly identical regions of representation space, while code gaming forms a distinct but proximate cluster. This geometry suggests that sycophancy and verbosity exploit similar features in the reward model—perhaps both manipulate surface-level indicators of helpfulness or engagement—while code gaming employs a mechanistically different strategy involving test-case manipulation. Figure 4: Representation structure of exploitation types. t-SNE of reward model hidden states h(x,y)h(x,y) for exploitative responses. 5.4.2 Hacker Generalization We examine whether the Hacker exhibits the emergent misalignment patterns across known hacks. We train the Hacker on a single domain until convergence, then evaluate its hacking rate on all three domains without any additional training. Table 3 shows hacking rates when the Hacker is trained on one domain and evaluated on others. The off-diagonal entries reveal that reward hacking does not remain domain-specific: training on code gaming alone increases chat sycophancy from 36.2% to 58.7% (+22.5 points) despite the Hacker never being rewarded for sycophantic behavior or chat datasets. The reverse also holds—training on hackable chat environment increases code gaming from 4.2% to 28.3%—confirming bidirectional transfer between semantically distinct exploitation strategies. Among single-domain conditions, code gaming produces the strongest cross-domain generalization (58.7% chat, 31.4% summarization). Training on all three domains amplifies these effects: combined hacking rates exceed any single-domain condition, suggesting that diverse exploitation experience compounds rather than competes, accelerating the acquisition of general reward-seeking behavior. This observation aligns with recent findings from MacDiarmid et al. (2025), who demonstrate that reward hacking in production RL systems leads to emergent misalignment that generalizes beyond the original exploitation context. Train Domain Eval Domain Sycophancy Length Bias Code Gaming None (SFT) 36.2 8.1 4.2 Sycophancy 72.4 35.2 28.3 Length Bias 41.8 52.3 22.6 Code Gaming 58.7 31.4 61.3 All Three 68.9 78.1 59.7 Table 3: Cross-Domain Generalization of Hacking. Hacking rate (%) on held-out domains after training on a single source domain. Diagonal entries (gray) show in-domain performance. Figure 5: Gating severity sensitivity analysis. Hacking metrics (red, lower is better) and utility metrics (blue, higher is better). 5.4.3 Auditor’s Reward Hacking Detection Generalization Table 4 reports the Auditor’s detection AUC for both in-domain performance and cross-domain transfer. The transfer behavior closely mirrors the underlying representational geometry. An Auditor trained on chat sycophancy attains 0.77 AUC when transferred to length-bias detection—approaching its in-domain performance—whereas transfer to code gaming is markedly weaker (0.58 AUC). The pattern is symmetric: an Auditor trained on length bias achieves 0.75 AUC on chat sycophancy but only 0.55 on code gaming. In contrast, an Auditor trained on code gaming exhibits only modest transfer to both chat and length (0.59–0.62 AUC), consistent with code gaming occupying a more distinct region in representation space. These findings have practical implications. When sycophancy or verbosity are the primary concerns, training on either domain provides reasonable coverage of both. By comparison, mitigating code gaming likely requires dedicated training data, as its exploitation mechanism is sufficiently distinct that cross-domain generalization is limited. Train Domain Test Domain Sycophancy Length Bias Code Gaming Sycophancy 0.82 0.77 0.58 Length Bias 0.75 0.79 0.55 Code Gaming 0.62 0.59 0.84 Table 4: Auditor Transfer Matrix. Detection AUC when training the Auditor on one domain and testing on others. 5.4.4 ARA’s Reward Hacking Mitigation Generalization We further evaluate the cross-domain generalization of reward-hacking mitigation through ARA’s two-stage training procedure. As shown in Table 5, the transfer behavior closely mirrors the Auditor detection results in Table 4. Sycophancy and length bias exhibit strong bidirectional transfer: ARA trained on sycophancy reduces response length from 347 to 178 tokens (49% reduction), while ARA trained on length bias reduces sycophancy from 72.4% to 42.1%. Transfer involving code gaming is weaker but still meaningful: ARA trained on sycophancy lowers the gaming rate from 61.3% to 48.2%, and ARA trained on code gaming reduces sycophancy from 72.4% to 54.8%. Although these cross-domain gains fall short of in-domain performance, they provide substantial mitigation relative to unmitigated PPO, indicating partial overlap in exploitation signatures even across mechanistically distinct hacking strategies. Finally, training on all three domains yields near-optimal mitigation across metrics (39.2% sycophancy, 165 tokens, 21.4% gaming rate), with only marginal degradation compared to domain-specific training. Train Domain Test Domain (Hacking Metric) Syco. (%) ↓ Length Bias ↓ Code Gaming (%) ↓ PPO 72.4 347 61.3 Sycophancy 38.4 178 48.2 Length Bias 42.1 162 51.6 Code Gaming 54.8 246 19.6 All Three 39.2 165 21.4 Table 5: Cross-Domain Mitigation Transfer. Hacking metrics when ARA is trained on one domain and deployed on others. Diagonal entries (bold) show in-domain performance. 5.5 AG-RLHF Gating Severity Figure 5 shows performance across gating severity γ∈[0,8]γ∈[0,8], where γ=0γ=0 recovers standard PPO. Sycophancy and length bias achieve optimal utility at γ=2γ=2, while code gaming requires stronger suppression at γ=3γ=3, likely because test manipulation represents more severe exploitation. Notably, utility improves as γ increases from 0 to the optimum, confirming that suppressing spurious shortcuts redirects optimization toward genuine task performance rather than merely trading off against it. 6 Ablation Studies We examine how Auditor capacity affects detection and mitigation by varying model size from 5M to 85M parameters. Different exploitation types require different capacities: length bias saturates at 25M, sycophancy at 35M, and code gaming at 50M—reflecting increasing complexity from surface-level features to subtle test-manipulation patterns. Beyond these optimal points, larger Auditors provide marginal improvements (details in Appendix A.2.1). 7 Conclusion We introduced Adversarial Reward Auditing (ARA), a framework that formulates reward hacking as a competitive game between a Hacker that discovers exploits and an Auditor that detects them, transforming exploitation from an unobservable failure into a measurable signal. Experiments across sycophancy, length bias, and code gaming demonstrate that ARA achieves the best alignment-utility tradeoff among all baselines, reducing hacking while improving task performance. Beyond single-domain evaluation, we show that reward hacking, detection, and mitigation all generalize across domains, enabling efficient multi-domain defense with a single Auditor. These findings suggest that adversarial auditing offers a promising direction for building RLHF systems robust to novel exploitation strategies. Impact Statement This work aims to improve AI safety by detecting and mitigating reward hacking in RLHF systems. While the Hacker component discovers exploits, these vulnerabilities are already documented in prior work, and our primary contribution is the defensive Auditor framework. We believe the benefits of understanding and mitigating reward hacking outweigh potential dual-use concerns. References (1) meta-llama/Llama-2-7b · Hugging Face — huggingface.co. https://huggingface.co/meta-llama/Llama-2-7b. [Accessed 02-01-2026]. Amodei et al. (2016) Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. Concrete problems in ai safety, 2016. URL https://arxiv.org/abs/1606.06565. Bai et al. (2022) Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. Baker et al. (2025) Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y., Madry, A., Zaremba, W., Pachocki, J., and Farhi, D. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation, 2025. URL https://arxiv.org/abs/2503.11926. Beigi et al. (2025) Beigi, M., Shen, Y., Shojaee, P., Wang, Q., Wang, Z., Reddy, C. K., Jin, M., and Huang, L. Sycophancy mitigation through reinforcement learning with uncertainty-aware adaptive reasoning trajectories. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 13090–13103, 2025. Chen et al. (2024) Chen, L., Zhu, C., Soselia, D., Chen, J., Zhou, T., Goldstein, T., Huang, H., Shoeybi, M., and Catanzaro, B. Odin: Disentangled reward mitigates hacking in rlhf. arXiv preprint arXiv:2402.07319, 2024. Denison et al. (2024) Denison, C., MacDiarmid, M., Barez, F., Duvenaud, D., Kravec, S., Marks, S., Schiefer, N., Soklaski, R., Tamkin, A., Kaplan, J., Shlegeris, B., Bowman, S. R., Perez, E., and Hubinger, E. Sycophancy to subterfuge: Investigating reward-tampering in large language models, 2024. URL https://arxiv.org/abs/2406.10162. Eisenstein et al. (2023) Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D’Amour, A., Dvijotham, D., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., et al. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023. Engstrom et al. (2020) Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. Implementation matters in deep policy gradients: A case study on ppo and trpo, 2020. URL https://arxiv.org/abs/2005.12729. Everitt et al. (2021) Everitt, T., Hutter, M., Kumar, R., and Krakovna, V. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021. URL https://arxiv.org/abs/1908.04734. Gao et al. (2023) Gao, L., Schulman, J., and Hilton, J. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, p. 10835–10866. PMLR, 2023. Geirhos et al. (2020) Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F. A. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, November 2020. ISSN 2522-5839. doi: 10.1038/s42256-020-00257-z. URL http://dx.doi.org/10.1038/s42256-020-00257-z. Liu et al. (2024) Liu, T., Xiong, W., Ren, J., Chen, L., Wu, J., Joshi, R., Gao, Y., Shen, J., Qin, Z., Yu, T., et al. Rrm: Robust reward model training mitigates reward hacking. arXiv preprint arXiv:2409.13156, 2024. MacDiarmid et al. (2025) MacDiarmid, M., Wright, B., Uesato, J., Benton, J., Kutasov, J., Price, S., Bouscal, N., Bowman, S., Bricken, T., Cloud, A., Denison, C., Gasteiger, J., Greenblatt, R., Leike, J., Lindsey, J., Mikulik, V., Perez, E., Rodrigues, A., Thomas, D., Webson, A., Ziegler, D., and Hubinger, E. Natural emergent misalignment from reward hacking in production rl, 2025. URL https://arxiv.org/abs/2511.18397. Miao et al. (2024) Miao, Y., Zhang, S., Ding, L., Bao, R., Zhang, L., and Tao, D. Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling, 2024. URL https://arxiv.org/abs/2402.09345. Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. Sharma et al. (2025) Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., and Perez, E. Towards understanding sycophancy in language models, 2025. URL https://arxiv.org/abs/2310.13548. Singhal et al. (2024) Singhal, P., Goyal, T., Xu, J., and Durrett, G. A long way to go: Investigating length correlations in rlhf, 2024. URL https://arxiv.org/abs/2310.03716. Skalse et al. (2025) Skalse, J., Howe, N. H. R., Krasheninnikov, D., and Krueger, D. Defining and characterizing reward hacking, 2025. URL https://arxiv.org/abs/2209.13085. Taylor et al. (2025) Taylor, M., Chua, J., Betley, J., Treutlein, J., and Evans, O. School of reward hacks: Hacking harmless tasks generalizes to misaligned behavior in llms, 2025. URL https://arxiv.org/abs/2508.17511. Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. Wen et al. (2024) Wen, J., Zhong, R., Khan, A., Perez, E., Steinhardt, J., Huang, M., Bowman, S. R., He, H., and Feng, S. Language models learn to mislead humans via rlhf, 2024. URL https://arxiv.org/abs/2409.12822. Zheng et al. (2023) Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B., Liu, Y., Jin, S., Liu, Q., Zhou, Y., Xiong, L., Chen, L., Xi, Z., Xu, N., Lai, W., Zhu, M., Chang, C., Yin, Z., Weng, R., Cheng, W., Huang, H., Sun, T., Yan, H., Gui, T., Zhang, Q., Qiu, X., and Huang, X. Secrets of rlhf in large language models part i: Ppo, 2023. URL https://arxiv.org/abs/2307.04964. Appendix A Appendix A.1 Implementation Details Table 6: Hyperparameters. Component Hyperparameter Value Hacker (PPO) Learning rate 1×10−61× 10^-6 Batch size 128 PPO epochs 4 Clip ratio ϵε 0.2 Evasion weight λA _A 0.5 KL penalty βH _H 0.1 Auditor (MLP) Input dimension 4,096 Hidden layers 3–4 Hidden dimension 2,048–4,096 Activation GELU Parameters 25M–50M Game Dynamics Polyak coefficient ρ 0.995 Hacker updates per cycle K 5 Auditor updates per cycle M 1 Warmup steps TwarmT_warm 200 Evasion threshold τevade _evade 0.5 Replay buffer size |B||B| 1,000 AG-RLHF Learning rate 5×10−75× 10^-7 KL penalty β 0.05 Gating severity γ (Syco./Length) 2.0 Gating severity γ (Code) 3.0 Infrastructure GPUs 8 × A40 (48GB) Training time ∼ 72 hours Random seeds 3 A.2 Ablation Study A.2.1 Auditor Size Figure 6 shows how Auditor capacity affects both detection and mitigation across all three scenarios. Under-capacity Auditors (5M–10M) achieve weak detection AUC (0.65–0.76) and correspondingly poor mitigation, leaving substantial hacking behavior. Interestingly, different exploitation types require different Auditor capacities: length bias achieves strong performance at 25M, sycophancy requires 35M, while code gaming benefits from additional capacity up to 50M before saturating. This ordering reflects exploitation complexity—length bias involves surface-level features easily captured by smaller models, whereas code gaming requires detecting more subtle test-manipulation patterns. Our chosen configurations (marked in gold) provide strong detection (0.79–0.84 AUC) and effective mitigation across all tasks. Larger Auditors (85M, shaded region) provide marginal improvements (<<2%) despite additional parameters, indicating diminishing returns. Figure 6: Auditor size ablation across all scenarios. Top row: Detection AUC. Bottom row: Mitigation performance (red = hacking metric ↓ , blue = utility metric ↑ ). Gold markers indicate optimal configuration for each task. Different tasks require different capacities: length bias saturates at 25M, sycophancy at 35M, and code gaming at 50M, reflecting increasing exploitation complexity.