Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
Steering LLMs via Scalable Interactive Oversight

Hang Yan, Zhikai Lei, Long Ma, Qi Zhang

Year: 2026Area: Scalable OversightCitations: -

Tags: scalable-oversight, empirical, alignment-training, ai-safety

E5 / R3 (94%)
A Benchmark for Scalable Oversight Protocols

Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery

Year: 2025Area: Scalable OversightCitations: 1

Tags: scalable-oversight, ai-safety, benchmark

E6 / R3 (95%)
A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Julian Schulz

Year: 2025Area: Scalable OversightCitations: -

Tags: theoretical, scalable-oversight, ai-safety

E7 / R3 (94%)
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

Juan Gustavo Corvalán, Francisca Gauna Selasco, Facundo Nieto, Gerardo I. Simari

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, empirical, alignment-training, ai-safety

E5 / R3 (93%)
An Alignment Safety Case Sketch Based on Debate

Geoffrey Irving, Jacob Pfau, Benjamin Hilton, Marie Davidsen Buhl

Year: 2025Area: Scalable OversightCitations: 8

Tags: theoretical, scalable-oversight, alignment-training, ai-safety

E5 / R3 (94%)
Avoiding Obfuscation with Prover-Estimator Debate

Geoffrey Irving, Jonah Brown-Cohen, Georgios Piliouras

Year: 2025Area: Scalable OversightCitations: 6

Tags: theoretical, scalable-oversight, ai-safety

E5 / R3 (94%)
Basic Legibility Protocols Improve Trusted Monitoring

Ashwin Sreevatsa, Sebastian Prasanna, Cody Rushing

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, empirical, ai-safety

E6 / R4 (96%)
Chain-of-Thought Monitorability: A New and Fragile Opportunity for AI Safety

Aleksander Madry, Mark Chen, Rohin Shah, Marius Hobbhahn

Year: 2025Area: Scalable OversightCitations: 88

Tags: scalable-oversight, ai-safety, position

E5 / R4 (95%)
Contrastive Weak-to-strong Generalization

Jiaxin Wu, Junfeng Fang, Houcheng Jiang, Yang Deng

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, empirical, ai-safety

E6 / R3 (95%)
Control Tax: The Price of Keeping AI in Check

Caglar Gulcehre, Zhen Ning David Liu, Mikhail Terekhov, Samuel Albanie

Year: 2025Area: Scalable OversightCitations: 3

Tags: scalable-oversight, empirical, ai-safety, safety-evaluation

E5 / R3 (94%)
Debate Helps Weak-to-Strong Generalization

Yongbin Li, Fei Huang, Hao Lang

Year: 2025Area: Scalable OversightCitations: 4

Tags: scalable-oversight, empirical, ai-safety

E6 / R3 (93%)
Ensemble Debates with Local Large Language Models for AI Alignment

Ephraiem Sarabamoun

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, empirical, alignment-training, ai-safety

-
FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Gabriel Recchia, Chatrik Singh Mangat, Gayatri Krishnakumar, Issac Li

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, ai-safety, dataset

E6 / R4 (93%)
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models

Zhenthing Qi, Lillian Sun, Himabindu Lakkaraju, Aounon Kumar

Year: 2025Area: Scalable OversightCitations: 6

Tags: scalable-oversight, empirical, ai-safety

E7 / R3 (93%)
Human-AI Complementarity: A Goal for Amplified Oversight

Rory Greig, Vladimir Mikulik, Tian Huey Teh, Sophie Bridgers

Year: 2025Area: Scalable OversightCitations: 4

Tags: scalable-oversight, empirical, ai-safety

E5 / R3 (93%)
Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Austin Meek, Iván Arcuschin, Eitan Sprejer, Austin J. Brockmeier

Year: 2025Area: Scalable OversightCitations: 1

Tags: scalable-oversight, empirical, ai-safety

E7 / R4 (97%)
Mechanistic Anomaly Detection for "Quirky" Language Models

Nora Belrose, David O. Johnston, Arkajyoti Chakraborty

Year: 2025Area: Scalable OversightCitations: 2

Tags: scalable-oversight, empirical, ai-safety

E5 / R3 (96%)
Monitoring Monitorability

Marcus Williams, Benjamin Arnav, Joost Huizinga, Micah Carroll

Year: 2025Area: Scalable OversightCitations: 6

Tags: scalable-oversight, ai-safety, safety-evaluation, benchmark

E5 / R3 (95%)
On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective

Wei Yao, Yong Liu, Ziqiao Wang, Gengze Xu

Year: 2025Area: Scalable OversightCitations: 1

Tags: theoretical, scalable-oversight, ai-safety

E4 / R3 (93%)
Reliable Weak-to-Strong Monitoring of LLM Agents

Chen Bo Calvin Zhang, Paula Rodriguez, Ankit Aich, Kevin Zhu

Year: 2025Area: Scalable OversightCitations: 4

Tags: scalable-oversight, empirical, ai-safety, red-teaming

E5 / R3 (95%)
Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions

Yihao Xue, Jiping Li, Baharan Mirzasoleiman

Year: 2025Area: Scalable OversightCitations: 5

Tags: theoretical, scalable-oversight, ai-safety

E5 / R3 (94%)
Research Superalignment Should Advance Now with Alternating Competence and Conformity Optimization

JinYeong Bak, James Evans, Jing Yao, Xiaoyuan Yi

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, alignment-training, ai-safety, position

E6 / R3 (93%)
Safety Cases: A Scalable Approach to Frontier AI Safety

Geoffrey Irving, Tomek Korbak, Benjamin Hilton, Marie Davidsen Buhl

Year: 2025Area: Scalable OversightCitations: 10

Tags: scalable-oversight, ai-safety, position

E5 / R3 (93%)
Scalable Human Oversight for Aligned Large Language Models: A Hybrid Framework for Intent Fidelity

Oluwasefunmi B. Famodimu, Folasade Y. Ayankoya, Olubukola D. Adekola, Shade O. Kuyoro

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, empirical, ai-safety

E4 / R2 (95%)
Scalable Oversight for Superhuman AI via Recursive Self-Critiquing

Xueru Wen, Yaojie Lu, Jie Lou, XingYu

Year: 2025Area: Scalable OversightCitations: 4

Tags: scalable-oversight, empirical, ai-safety

E5 / R3 (97%)
Scalable Oversight via Partitioned Human Supervision

Takashi Ishida, Masashi Sugiyama, Ren Yin

Year: 2025Area: Scalable OversightCitations: -

Tags: theoretical, scalable-oversight, ai-safety

E5 / R3 (96%)
Scaling Laws For Scalable Oversight

David D. Baek, Joshua Engels, Subhash Kantamneni, Max Tegmark

Year: 2025Area: Scalable OversightCitations: 4

Tags: theoretical, scalable-oversight, ai-safety

E7 / R3 (97%)
Super Co-alignment of Human and AI for Sustainable Symbiotic Society

Boyuan Chen, Lei Wang, Dongqi Liang, Yuwei Wang

Year: 2025Area: Scalable OversightCitations: 3

Tags: scalable-oversight, alignment-training, ai-safety, position

E5 / R3 (94%)
Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection

Yongqiang Chen, Bo Han, Gang Niu, James Cheng

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, empirical, ai-safety

E5 / R3 (96%)
Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction

Cameron Allen, Micah Carroll, Tianyi Alex Qiu

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, empirical, ai-safety, safety-evaluation

E5 / R3 (95%)

Showing 30 of 66 papers on page 1.