Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Steering LLMs via Scalable Interactive Oversight Hang Yan, Zhikai Lei, Long Ma, Qi Zhang Year: 2026Area: Scalable OversightCitations: - Tags: scalable-oversight, empirical, alignment-training, ai-safety | 2026 | Scalable Oversight | scalable-oversight, empirical, alignment-training, ai-safety | E5 / R3 (94%) | - |
| A Benchmark for Scalable Oversight Protocols Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery Year: 2025Area: Scalable OversightCitations: 1 Tags: scalable-oversight, ai-safety, benchmark | 2025 | Scalable Oversight | scalable-oversight, ai-safety, benchmark | E6 / R3 (95%) | 1 |
| A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring Julian Schulz Year: 2025Area: Scalable OversightCitations: - Tags: theoretical, scalable-oversight, ai-safety | 2025 | Scalable Oversight | theoretical, scalable-oversight, ai-safety | E7 / R3 (94%) | - |
| AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs Juan Gustavo Corvalán, Francisca Gauna Selasco, Facundo Nieto, Gerardo I. Simari Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, empirical, alignment-training, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, alignment-training, ai-safety | E5 / R3 (93%) | - |
| An Alignment Safety Case Sketch Based on Debate Geoffrey Irving, Jacob Pfau, Benjamin Hilton, Marie Davidsen Buhl Year: 2025Area: Scalable OversightCitations: 8 Tags: theoretical, scalable-oversight, alignment-training, ai-safety | 2025 | Scalable Oversight | theoretical, scalable-oversight, alignment-training, ai-safety | E5 / R3 (94%) | 8 |
| Avoiding Obfuscation with Prover-Estimator Debate Geoffrey Irving, Jonah Brown-Cohen, Georgios Piliouras Year: 2025Area: Scalable OversightCitations: 6 Tags: theoretical, scalable-oversight, ai-safety | 2025 | Scalable Oversight | theoretical, scalable-oversight, ai-safety | E5 / R3 (94%) | 6 |
| Basic Legibility Protocols Improve Trusted Monitoring Ashwin Sreevatsa, Sebastian Prasanna, Cody Rushing Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E6 / R4 (96%) | - |
| Chain-of-Thought Monitorability: A New and Fragile Opportunity for AI Safety Aleksander Madry, Mark Chen, Rohin Shah, Marius Hobbhahn Year: 2025Area: Scalable OversightCitations: 88 Tags: scalable-oversight, ai-safety, position | 2025 | Scalable Oversight | scalable-oversight, ai-safety, position | E5 / R4 (95%) | 88 |
| Contrastive Weak-to-strong Generalization Jiaxin Wu, Junfeng Fang, Houcheng Jiang, Yang Deng Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E6 / R3 (95%) | - |
| Control Tax: The Price of Keeping AI in Check Caglar Gulcehre, Zhen Ning David Liu, Mikhail Terekhov, Samuel Albanie Year: 2025Area: Scalable OversightCitations: 3 Tags: scalable-oversight, empirical, ai-safety, safety-evaluation | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety, safety-evaluation | E5 / R3 (94%) | 3 |
| Debate Helps Weak-to-Strong Generalization Yongbin Li, Fei Huang, Hao Lang Year: 2025Area: Scalable OversightCitations: 4 Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E6 / R3 (93%) | 4 |
| Ensemble Debates with Local Large Language Models for AI Alignment Ephraiem Sarabamoun Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, empirical, alignment-training, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, alignment-training, ai-safety | - | - |
| FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research Gabriel Recchia, Chatrik Singh Mangat, Gayatri Krishnakumar, Issac Li Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, ai-safety, dataset | 2025 | Scalable Oversight | scalable-oversight, ai-safety, dataset | E6 / R4 (93%) | - |
| Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models Zhenthing Qi, Lillian Sun, Himabindu Lakkaraju, Aounon Kumar Year: 2025Area: Scalable OversightCitations: 6 Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E7 / R3 (93%) | 6 |
| Human-AI Complementarity: A Goal for Amplified Oversight Rory Greig, Vladimir Mikulik, Tian Huey Teh, Sophie Bridgers Year: 2025Area: Scalable OversightCitations: 4 Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E5 / R3 (93%) | 4 |
| Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity Austin Meek, Iván Arcuschin, Eitan Sprejer, Austin J. Brockmeier Year: 2025Area: Scalable OversightCitations: 1 Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E7 / R4 (97%) | 1 |
| Mechanistic Anomaly Detection for "Quirky" Language Models Nora Belrose, David O. Johnston, Arkajyoti Chakraborty Year: 2025Area: Scalable OversightCitations: 2 Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E5 / R3 (96%) | 2 |
| Monitoring Monitorability Marcus Williams, Benjamin Arnav, Joost Huizinga, Micah Carroll Year: 2025Area: Scalable OversightCitations: 6 Tags: scalable-oversight, ai-safety, safety-evaluation, benchmark | 2025 | Scalable Oversight | scalable-oversight, ai-safety, safety-evaluation, benchmark | E5 / R3 (95%) | 6 |
| On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective Wei Yao, Yong Liu, Ziqiao Wang, Gengze Xu Year: 2025Area: Scalable OversightCitations: 1 Tags: theoretical, scalable-oversight, ai-safety | 2025 | Scalable Oversight | theoretical, scalable-oversight, ai-safety | E4 / R3 (93%) | 1 |
| Reliable Weak-to-Strong Monitoring of LLM Agents Chen Bo Calvin Zhang, Paula Rodriguez, Ankit Aich, Kevin Zhu Year: 2025Area: Scalable OversightCitations: 4 Tags: scalable-oversight, empirical, ai-safety, red-teaming | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety, red-teaming | E5 / R3 (95%) | 4 |
| Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions Yihao Xue, Jiping Li, Baharan Mirzasoleiman Year: 2025Area: Scalable OversightCitations: 5 Tags: theoretical, scalable-oversight, ai-safety | 2025 | Scalable Oversight | theoretical, scalable-oversight, ai-safety | E5 / R3 (94%) | 5 |
| Research Superalignment Should Advance Now with Alternating Competence and Conformity Optimization JinYeong Bak, James Evans, Jing Yao, Xiaoyuan Yi Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, alignment-training, ai-safety, position | 2025 | Scalable Oversight | scalable-oversight, alignment-training, ai-safety, position | E6 / R3 (93%) | - |
| Safety Cases: A Scalable Approach to Frontier AI Safety Geoffrey Irving, Tomek Korbak, Benjamin Hilton, Marie Davidsen Buhl Year: 2025Area: Scalable OversightCitations: 10 Tags: scalable-oversight, ai-safety, position | 2025 | Scalable Oversight | scalable-oversight, ai-safety, position | E5 / R3 (93%) | 10 |
| Scalable Human Oversight for Aligned Large Language Models: A Hybrid Framework for Intent Fidelity Oluwasefunmi B. Famodimu, Folasade Y. Ayankoya, Olubukola D. Adekola, Shade O. Kuyoro Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E4 / R2 (95%) | - |
| Scalable Oversight for Superhuman AI via Recursive Self-Critiquing Xueru Wen, Yaojie Lu, Jie Lou, XingYu Year: 2025Area: Scalable OversightCitations: 4 Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E5 / R3 (97%) | 4 |
| Scalable Oversight via Partitioned Human Supervision Takashi Ishida, Masashi Sugiyama, Ren Yin Year: 2025Area: Scalable OversightCitations: - Tags: theoretical, scalable-oversight, ai-safety | 2025 | Scalable Oversight | theoretical, scalable-oversight, ai-safety | E5 / R3 (96%) | - |
| Scaling Laws For Scalable Oversight David D. Baek, Joshua Engels, Subhash Kantamneni, Max Tegmark Year: 2025Area: Scalable OversightCitations: 4 Tags: theoretical, scalable-oversight, ai-safety | 2025 | Scalable Oversight | theoretical, scalable-oversight, ai-safety | E7 / R3 (97%) | 4 |
| Super Co-alignment of Human and AI for Sustainable Symbiotic Society Boyuan Chen, Lei Wang, Dongqi Liang, Yuwei Wang Year: 2025Area: Scalable OversightCitations: 3 Tags: scalable-oversight, alignment-training, ai-safety, position | 2025 | Scalable Oversight | scalable-oversight, alignment-training, ai-safety, position | E5 / R3 (94%) | 3 |
| Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection Yongqiang Chen, Bo Han, Gang Niu, James Cheng Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, empirical, ai-safety | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety | E5 / R3 (96%) | - |
| Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction Cameron Allen, Micah Carroll, Tianyi Alex Qiu Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, empirical, ai-safety, safety-evaluation | 2025 | Scalable Oversight | scalable-oversight, empirical, ai-safety, safety-evaluation | E5 / R3 (95%) | - |
Showing 30 of 66 papers on page 1.