Paper deep dive

Mapping Technical Safety Research at AI Companies: A literature review and incentives analysis

Oscar Delaney, Oliver Guest, Zoe Williams

Year: 2024Venue: arXiv preprintArea: Surveys & ReviewsType: SurveyEmbeddings: 95

Models: GPT-4o-mini

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 5:44:46 PM

Summary

This report analyzes technical safety research conducted by Anthropic, Google DeepMind, and OpenAI between 2022 and 2024. It categorizes 80 relevant papers into nine safety approaches, identifying that corporate research is heavily concentrated in enhancing human feedback and mechanistic interpretability, while critical areas like multi-agent safety and safety-by-design remain under-researched.

Entities (6)

Anthropic · ai-company · 100%Google DeepMind · ai-company · 100%OpenAI · ai-company · 100%Enhancing human feedback · research-approach · 95%Mechanistic interpretability · research-approach · 95%Safety by design · research-approach · 95%

Relation Signals (3)

Anthropic → conductsresearchin → Enhancing human feedback

confidence 90% · The report analyzes technical research into safe AI development being conducted by three leading AI companies: Anthropic, Google DeepMind, and OpenAI.

Google DeepMind → conductsresearchin → Mechanistic interpretability

confidence 90% · The report categorizes 80 included papers into nine safety approaches, including mechanistic interpretability.

OpenAI → lacksresearchin → Multi-agent safety

confidence 90% · We identified three categories where there are currently no or few papers... These are model organisms of misalignment, multi-agent safety, and safety by design.

Cypher Suggestions (2)

Find all research approaches and the companies associated with them. · confidence 90% · unvalidated

MATCH (c:Company)-[:CONDUCTS_RESEARCH_IN]->(a:Approach) RETURN c.name, a.name

Identify research approaches with zero or low publication volume. · confidence 85% · unvalidated

MATCH (a:Approach) WHERE a.paper_count < 5 RETURN a.name, a.paper_count

Abstract

Abstract:As AI systems become more advanced, concerns about large-scale risks from misuse or accidents have grown. This report analyzes the technical research into safe AI development being conducted by three leading AI companies: Anthropic, Google DeepMind, and OpenAI. We define safe AI development as developing AI systems that are unlikely to pose large-scale misuse or accident risks. This encompasses a range of technical approaches aimed at ensuring AI systems behave as intended and do not cause unintended harm, even as they are made more capable and autonomous. We analyzed all papers published by the three companies from January 2022 to July 2024 that were relevant to safe AI development, and categorized the 80 included papers into nine safety approaches. Additionally, we noted two categories representing nascent approaches explored by academia and civil society, but not currently represented in any research papers by these leading AI companies. Our analysis reveals where corporate attention is concentrated and where potential gaps lie. Some AI research may stay unpublished for good reasons, such as to not inform adversaries about the details of security techniques they would need to overcome to misuse AI systems. Therefore, we also considered the incentives that AI companies have to research each approach, regardless of how much work they have published on the topic. We identified three categories where there are currently no or few papers and where we do not expect AI companies to become much more incentivized to pursue this research in the future. These are model organisms of misalignment, multi-agent safety, and safety by design. Our findings provide an indication that these approaches may be slow to progress without funding or efforts from government, civil society, philanthropists, or academia.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

94,982 characters extracted from source content.

Expand or collapse full text

September2024 MappingTechnical SafetyResearchatAI Companies Aliteraturereviewandincentivesanalysis AUTHORS OscarDelaney–ResearchAssistant OliverGuest–ResearchAnalyst ZoeWilliams–ActingCo-Director 1 ExecutiveSummary Asartificialintelligence(AI)systemsbecomemoreadvanced,concernsaboutlarge-scalerisks frommisuseoraccidentshavegrown.ThisreportanalyzesthetechnicalresearchintosafeAI developmentbeingconductedbythreeleadingAIcompanies:Anthropic,GoogleDeepMind, andOpenAI. Wedefine“safeAIdevelopment”asdevelopingAIsystemsthatareunlikelytoposelarge-scale misuseoraccidentrisks.Thisencompassesarangeoftechnicalapproachesaimedatensuring AIsystemsbehaveasintendedanddonotcauseunintendedharm,evenastheyaremade morecapableandautonomous. WeanalyzedallpaperspublishedbythethreecompaniesfromJanuary2022toJuly2024that wererelevanttosafeAIdevelopment,andcategorizedthe80includedpapersintoninesafety approaches. 1 Additionally,wenotedtwocategoriesrepresentingnascentapproachesexplored byacademiaandcivilsociety,butnotcurrentlyrepresentedinanyresearchpapersbythese leadingAIcompanies.Ouranalysisrevealswherecorporateattentionisconcentratedand wherepotentialgapslie(Table1). SomeAIresearchmaystayunpublishedforgoodreasons,suchastonotinformadversaries aboutthedetailsofsafetyandsecuritytechniquestheywouldneedtoovercometomisuseAI systems.Therefore,wealsoconsideredtheincentivesthatAIcompanieshavetoresearcheach approach,regardlessofhowmuchworktheyhavepublishedonthetopic.Inparticular,we consideredreputationaleffects,regulatoryburdens,andtowhatextenttheapproachescould beusedtomakethecompany’sAIsystemsmoreuseful. Weidentifiedthreecategorieswheretherearecurrentlynoorfewpapersandwherewedonot expectAIcompaniestobecomemuchmoreincentivizedtopursuethisresearchinthefuture. Thesearemodelorganismsofmisalignment,multi-agentsafety,andsafetybydesign.Our findingsprovideanindicationthattheseapproachesmaybeslowtoprogresswithoutfunding oreffortsfromgovernment,civilsociety,philanthropists,oracademia. 1 Becauseofissueswithourcode,anearlierversionofourresearchincorrectlyexcludedsomerelevantpapers.We apologizefortheerror.Ourcorrectedpaperwasfinaliz edonSeptember25 th 2024andtheoriginalversioncanstill beaccessedforarchivalpurposesonarXiv.TheissuesaredescribedinmoredetailonourGitHubrepository. MappingTechnicalSafetyResearchatAICompanies |2 Table1:AmountofpublicresearchindifferentcategoriesfromleadingAIcompanies AreaProportion ofrelevant papers 2 Enhancinghumanfeedback DevelopingbetterwaysofincorporatinghumanpreferenceswhentrainingadvancedAI modelsincaseswherepeoplemightstruggletogiveadequatefeedbackonAIoutputs. 39% Mechanisticinterpretability Developingtoolstoconvertmodelweightsintousefulhigher-levelhumanconcepts describingthemodel'sbeliefsandreasoningprocesses. 24% Robustness Improvingtheworst-caseperformanceofAIsystemsevenonanomalousinputs,reducing thelikelihoodofunpredictableandunintendedbehaviorsinnovelsituations. 13% Safetyevaluations AssessingwhetheranAIsystempossessesdangerouscapabilities,toinformdecisions aboutmitigationsandwhetherthesystemissafeenoughtodeployorcontinuetotrain. 11% Power-seekingtendencies UnderstandingwhetherandhowAIsystemsdisplaypower-seekingtendencies,and investigatingmethodstoinhibitsuchtendencies. 4% HonestAI EnsuringthatAIsystemsaccuratelycommunicatetheirbeliefsandreasoning,makingit easiertodetectharmfulgoalsandplansinhighlycapablesystems. 4% Safetybydesign PioneeringnovelapproachestobuildingintrinsicallysafeAIsystems,suchaswithformal proofsthatsystemswillbehaveincertainways. 3% Unlearning Deliberatelymakingmodelslesscapableforsomedangeroustasks. 3% Modelorganismsofmisalignment CreatingsimpledemonstrationsofAIdeceptionorotherconcerningbehaviors,andtesting whetherproposedsafetytechniquesworkontheseexamples. 1% Multi-agentsafety UnderstandingandmitigatingrisksfrominteractionsbetweenAIsystems. 3 0% ControllinguntrustedAIs TechniquestomakeAImodelslessdangerous,evenifthosemodelsare“misaligned,”i.e., inclinedtoactinawaythatthedeveloperdoesnotintend. 0% 3 Althoughwecouldnotfindpapersinscopethatarespecificallyaboutsafemulti-agentsafety,therearepapers aboutmulti-agentinteractionsingeneral,particularlyfromDeepMind(forexampleAgapiouetal.,2023). 2 “Relevantpapers”referstoresearchpublishedbyAnthropic,(Google)DeepMind,andOpenAIbetweenJanuary 2022andJuly2024(inclusive)thatmetourinclusioncriteria. MappingTechnicalSafetyResearchatAICompanies |3 TableofContents ExecutiveSummary......................................................................................................2 TableofContents..........................................................................................................4 Introduction..................................................................................................................5 Method.........................................................................................................................7 DefiningapproachestosafeAIdevelopment...........................................................7 Quantitativeanalysisofpublishedresearch..............................................................7 Qualitativeanalysisofcompanies’incentives...........................................................8 ResultsandAnalysis...................................................................................................10 Enhancinghumanfeedback(31papers)................................................................10 Mechanisticinterpretability(19papers)..................................................................12 Robustness(10papers).........................................................................................13 Safetyevaluations(9papers).................................................................................15 Power-seekingtendencies(3papers)....................................................................17 HonestAI(3papers)..............................................................................................19 Safetybydesign(2papers)...................................................................................21 Unlearning(2papers).............................................................................................23 Modelorganismsofmisalignment(1paper)...........................................................24 Multi-agentsafety(0papers).................................................................................25 ControllinguntrustedAIs(0papers).......................................................................27 FutureProspects.........................................................................................................29 Acknowledgements....................................................................................................31 Bibliography................................................................................................................32 MappingTechnicalSafetyResearchatAICompanies |4 Introduction Therearesignificantconcernsfromgovernments,civilsociety,andAIcompanies 4 that advancedAIsystemsmayposelarge-scalerisksfrommisuseoraccidents(TheWhiteHouse, 2023;Bengioetal.,2024;CenterforAISafety,2023). Inthispaper,weexploreaparticularsubsetoftheworkbeingdonetoreducelarge-scaleAI risks.Specifically,weinvestigateAIcompanies’technicalresearchintosafeAIdevelopment, andwhatresearchapproachesaremorecommon. Weuse“safeAIdevelopment”tomeandevelopingAIsystemsthatareunlikelytopose large-scalemisuseoraccidentrisks.Thisisrelatedto,butbroaderthan,theconceptofAI ‘alignment,’whichgenerallyreferstomakingAIsystemsactasintendedbythedeveloper (Guest,Aird,andÓhÉigeartaigh,2023,p.33–34).SafeAIdevelopmentisjustonetopic wheretechnicalresearchmightbeneededtoreduceAIrisks.Othertypesofpotentiallyvaluable researchinclude: ●ReducingrisksposedbyAIsystemsthathavealreadybeendeployedsuchas bymaking“deploymentcorrections”(O’Brien,Ee,andWilliams,2023). ● TechnicalAIgovernance,i.e.,usingtechnicalanalysisandtoolstosupportAI governance.Forexample,usingtechnicalmeasurestomakeithardertostealhigh-risk AImodels(Reueletal.,2024). ● SystemicAIsafety,i.e.,reducingAIrisksbyfocusingonthecontextsinwhichAI systemsoperate(UKAISafetyInstitute,2024b).Inaninfluentialpaper,Hendrycksetal. (2022)focusparticularlyonusingmachinelearningtechniquestoincreasesystemicAI safety.Forexample,usingAIsystemstoincreasecybersecuritytodefendagainst AI-enabledcyberattacks. WealsocommentontheincentivesthatAIcompanieshavetododifferenttypesofresearch. Weusethisincentivesanalysistomakesometentativepredictionsaboutfuturechangesinthe sortsofsafetyresearchcompanieswilldo. UnderstandingwhatresearchAIcompaniesaredoing,andwilldoinfuture,isvaluablefor allocatingresearchfundingfromotheractors.Iffundersdonottakeintoaccountwhatresearch AIcompaniesarelikelytodo,thentheymightwasteresourcesbyduplicatingefforts. Additionally,theymightideallyfundresearchthatmakesdifferent‘bets’toAIcompanies’ research,makingadiversifiedportfolioofsafetyinvestmentsforgreaterdefenseindepth 4 By“AIcompanies”,wemeanthecompaniesthataredevelopingAIsystems.Weparticularlyfocusoncompanies thataretrainingverycompute-intensivemodels,i.e.,modelsthatrequirecomputationworthtensofmillionsof dollarstotrain.ExamplesofsuchcompaniesincludeAnthropic,GoogleDeepMind,andOpenAI.Manyexpertshave linkedthemostsevereAIrisksspecificallytocompute-intensivemodels(Bengioetal.,2024). MappingTechnicalSafetyResearchatAICompanies |5 (Barnes,2024,p.100–101).Thatsaid,ourpapercannotbeusedonitsowntoidentify promisingfundingopportunities;weleaveotherrelevantconsiderations(notedbelow)outof scope.Weplantomakerecommendationsforgovernmentandphilanthropicfundingthat considerthesefactorsinfollow-upwork. AdditionalconsiderationsforresearchintosafeAIdevelopment ThispaperexploresAIcompanies'likelyresearchfocus.Whendecidingwhatresearch wouldideallybesupportedbyotheractors,suchasgovernmentsorphilanthropists,several additionalconsiderationsarebeyondourscope: ●Howpromisingistheresearchapproach?Everythingelsebeingequal,funders andresearchersshouldfocusonapproachesthataremorelikelytobefruitful. 5 ● Mighttheresearchbringadditionalbenefits?Researchcouldbehelpfulboth forreducinglarge-scaleAIrisksandforachievingotherdesirablegoals. ● Mighttheresearchcauseharm?Researchersandfundersshouldbemore reluctanttosupportresearchthatdoesharmaswellasbringingbenefits. 6 ● Istherelevantinstitutioninagoodpositiontodoorfundthisresearch? Specificinstitutionswillhaveexpertiseorstructuresthatmakethemparticularly well-suitedtosupportparticularkindsofresearch. 7 Inthefollowingsectionwedescribeourmethod.Wethenpresentourresults,goingthrough thedifferentapproachestosafeAIdevelopment,sortedbynumberofpapers.Weconclude withthoughtsonwhatcategoriesmightseemoreresearchfromAIcompaniesinfuture. 7 Forexample,academiamightbeparticularlywell-suitedforresearchintomechanisticinterpretability.Improving interpretabilitytechniquesdoesnotrequireusingthelargestmodels(whicharesometimesoutofreachofacademic researchers),andisamenabletopublishingpapers(Kästner&Crook,2023;Zimmermann,Klein,andBrendel, 2024),consistentwithacademicincentives. 6 SeeGuest,Aird,andÓhÉigeartaigh(2023)foranoverviewofwaysinwhicheffor tstosupportalignmentresearch mightcounterintuitivelyincreaselarge-scaleAIrisks. 5 Predictingresearchsuccessinadvanceisdifficult.Asaresult,fundersshouldoftenbewillingtoinvestin 'moonshots'thatseemunlikelytowork,butwouldhaveanoutsizedimpactiftheydo.Additionally,investingin addressingrisksnotaddressedbyotherapproachesisespeciallyvaluable. MappingTechnicalSafetyResearchatAICompanies |6 Methods DefiningapproachestosafeAIdevelopment OurfirststepwasidentifyingdistinctcategoriesofresearchwithinsafeAIdevelopment.As describedabove,thismeansfocusingontechnicalresearchthatcouldbeusedtodevelopAI systemsthatareunlikelytoposelarge-scaleaccidentormisuserisks.Wecarriedouta literaturereviewofresearchagendasandtaxonomiesthatarerelevanttosafeAIdevelopment 8 andconsultedmultipleexpertsinthefield.Wethencollatedresearchapproachesinto11key categories. Quantitativeanalysisofpublishedresearch OurprimarymethodwastodeterminehowmanypapersbyleadingAIcompaniesaboutsafe AIdevelopmentareineachofthecategoriesthatweidentified.Thepapersthatwecollected, andourcategorizationsofthem,areavailableonaGoogleSheet.Codeanddetailsaboutour methodisonGitHub. Wehadthreeinclusioncriteriaforpapers: ●StronglyassociatedwithAnthropic,(Google)DeepMind,orOpenAI. 9 We operationalizedthisasresearchthatispublishedonthosecompanies’websitesor researchwherethefirstauthorlistsoneofthesecompaniesasanaffiliation.We focusedonthesethreecompaniesbecausetheyarearguablytheleadingdevelopersof advancedAI. 10 Asaresult,theyareparticularlywell-placedtodoresearchintosafeAI development,aswellasorganizationstowhomthisresearchisparticularlyrelevant. ● PublishedbetweenJanuary2022andJuly2024.Wechosethistimeperiodto balancecollectingalargenumberofpaperswithbeingup-to-date. ● RelevanttosafeAIdevelopment.Tomakethisjudgmentwereviewedthepaper’s titleandabstract,and,ifwewereuncertain,therestofthepaper. 10 Forinstance,theyhavecarriedoutthelargestpubliclyknowntrainingrunsandareamongthecompanies responsibleforthemostcitationsinAIresearch(Cottier,2023).Additionalpointsofevidenceincludethatthese companiescurrentlyseemtohavethebestchatbotproducts(Chiangetal.,2024)andthattheyarereferredtoas theprimaryartificialgeneralintelligencecompaniesinotherresearch,suchasSchuettetal.(2023,p.3).Other companiesthatwouldberelevantbysomeofthesecriteriaincludeMeta,Microsoft,andnon-DeepMindpartsof Google. 9 DeepMindhasbeenapartofGooglesince2014.InApril2023,itwasmergedwithGoogleBrain,anotherAIgroup withinGoogle,andrenamedGoogleDeepMind(Murgia,2023).Weuse“(Google)DeepMind”torefertoDeepMind (butnotGoogleBrain)uptoApril2023andGoogleDeepMindfromApril2023. 8 Theagendasandtaxonomieswereframedinslightlyvaryingways,e.g.,sometimesintermsof“alignment”or“ML safety”.ThemainsourcesthatweconsultedwereHendrycksetal.,2022;OpenPhilanthropy,2022;Critch& Krueger,2020;Jietal.,2024;Anwaretal.,2024;Toner&Acharya,2022;andAmodeietal.,2016. MappingTechnicalSafetyResearchatAICompanies |7 Welookedforpapersintwoplaces: ●Companywebsites:Wemanuallyreviewedpostsonthewebsitesofthethree companiestoseewhethertheylinkedtoresearchthatmeetsourinclusioncriteria. ● ArXiv:ArXivisapreprintserverwheremachinelearningresearchisoftenpublished.We programmaticallycollectedpapersonarXivthatwerefromtherelevanttimeframe,had specifictermsinthetitlerelatingtosafeAIdevelopment,andwherethefirstauthor listedanaffiliationatoneofthethreecompanies.Wemanuallyreviewedtheresulting papersforinclusioninourdataset. 11 Ourfinaldatasethad80papers.Wemanuallyreviewedthetitleandabstractofeachofthese paperstodeterminetherightcategory. 12 Somepapersarerelevanttoseveraloftheresearch categories,inwhichcasewepickedthesinglebestfit. Alimitationofourmethodisthatitdoesnotincludeunpublishedresearch,orresearchthatAI companiessharepubliclybutnotasapaper.Thisispotentiallyasignificantlimitation,asthere areseveralreasonswhyAIcompaniesmightnotpublishtheirresearchintosafeAI development. 13 Forexample,researchintosafeAIdevelopmentcansometimesbeusedto makeAIsystemsbothsaferandmorecapable(Guest,Aird,andÓhÉigeartaigh,2023,p. 15–20;Brady,2024;Khlaaf,2023).Makingresultsfromsuchresearchaccessibleto competitorsmightputcompaniesatacommercialdisadvantage(Albergotti,2024).Making resultsaccessibletoforeigngovernmentsmighthavenationalsecurityimplications.Additionally, companiesmighthavemundanereasonsnottopublishtheresearch,suchasnotwantingto gothroughtheprocessofwritingapaper. Qualitativeanalysisofcompanies’incentives WealsodidaqualitativeanalysisofthecorporateincentivesthatAIcompanieshavetodo researchinthevariouscategories.Thishelpstoadjustforthefactthatsomeresearchwillnot bepublished.ItmightalsobehelpfulformakingpredictionsabouthowresearchbyAI companieswillchangeovertime.WefocusonthreekindsofincentivesthatAIcompanieshave todoresearchintosafeAIdevelopment. 14 14 Itmayalsobethatthevaluesoftheemployeesorleadershipofcompaniescausesthemtodomoresafety researchthancorporateincentiveswouldsuggest.Indeed,thethreeleadingAIcompanieswereeachfoundedwith reducinglarge-scaleAIrisksasimportantmotivations(Perrigo,2023a;Matthews,2023;Metz,2016).However,itis 13 Thismayvarybyresearchcategoryaswell.Forinstancemechanisticinterpretabilityresearchisrelatively unproblematictopublish,whereassometypesofadversarialrobustnessresearcharesafertonotpublish,lest adversarieslearnhowtodefeatsafetystrategiesemployed. 12 AsexplainedinmoredetailonGitHub,wealsousedalanguagemodeltocheckourcategorizations.Insome cases,weupdatedourcategorizationbasedonthisstep. 11 Becauseofissueswithourcode,anearlierversionofourresearchincorrectlyexcludedsomerelevantpapersfrom arXiv.Weapologizefortheerror.Ourcorrectedpaperwasfinaliz edonSeptember25 th 2024andtheoriginalversion canstillbeaccessedforarchivalpurposesonarXiv.TheissuesaredescribedinmoredetailonourGitHub repository. MappingTechnicalSafetyResearchatAICompanies |8 IncentivesforAIcompaniestodosafeAIdevelopmentresearch Reputation:CompanieswithbetterreputationsforsafeAIdevelopmentwillprobablyfindit easiertoattractthebesttalentandthemostcustomers.Recentsurveyshaveshown majoritiesofrespondentsareconcernedaboutrisksfromAIandsupportsafetyefforts (DepartmentforScience,Innovation&Technology,2023b;Pauketat,Bullock,andAnthis, 2023).SafeAIdevelopmentmightalsoreducethedownsiderisksofacompany’s reputationbeingharmedifitsmodelisinvolvedinalarge-scaleincident,eitherthrough misuseoraccident.DifferentresearchcategoriesinsafeAIdevelopmentmightcontribute differentlytothereputationsofAIcompanies.Forexample,somecategoriesmightbeof moreinteresttothepublic. Regulation:AIcompaniesmightdoresearchintosafeAIdevelopmentsothattheycan complywithregulations.Thesemightincluderegulationsthatdirectlyregulatesafetyand securityrisks,orregulationsaboutothertopicsthattouchonsafetyandsecurityrisks.For example,manyjurisdictionsareconsideringrulesthatwouldrequireAIdecision-makingto be“explainable”inimportantcontexts(InformationCommissioner’sOfficeandTheAlan TuringInstitute,2022,p.10–15).Thetechnicaldevelopmentsrequiredtoimplementthis mightalsobehelpfulforpreventingAIaccidents,suchasbyhelpingusbetterunderstand howAIsystemsmakedecisions. Usefulness:SomeresearchintosafeAIdevelopmentmighthaveaco-benefitforAI companiesviaalsoimprovingtheusefulnessofthecompany’sAIsystems. 15 Forexample, reinforcementlearningfromhumanfeedbackwasinitiallydevelopedasasafetytechnique, butalsomakesAIsystemsmorelikelytorespondhelpfullytouserqueries(Casperetal., 2023). 15 Indeed,thereisnoclearboundarybetweensafetyandcapabilitiesresearch;someworkcouldfallinagrayareaof beingboth. unclearhowvaluesaroundreducinglarge-scaleAIriskswouldtranslateintodecisionsaboutwhichresearch categoriestoprioritize.Moreover,companiesmaybecomeincreasinglyprofit-drivenasthefinancialstakesofAI developmentbecomelarger.Therefore,wedidnotconsidercompanies’valuesinouranalysis. MappingTechnicalSafetyResearchatAICompanies |9 ResultsandAnalysis Figure1summarizesourresultsforthe80papersthatmetourinclusioncriteria.Theremainder ofthissectionpresentsourdetailedfindingsforeachresearchcategory. Figure1:DistributionofsafeAIdevelopmentresearchpapersbycategoryandbyselectcompanies,from January2022toJuly2024. Enhancinghumanfeedback(31papers) Researchonenhancinghumanfeedbackaimstoimprovetheuseofhumanpreferencedatain trainingAIsystemstomakedecisionshumanswouldapproveof,iftheyknewalltherelevant details.AkeytechniqueinthiscategoryisReinforcementLearningfromHumanFeedback (RLHF),whichuseshumanpreferencestofine-tunemodels(Christianoetal.,2017).RLHF updatesmodelweightstoincreasethelikelihoodofproducingoutputsthathumanraterswould judgeashelpful. WhileRLHFhasproveneffectiveforcurrentsystems,itfacesasignificantchallengeknownas the“scalableoversight”problem(Bowmanetal.,2022;Casperetal.,2023).AsAIcapabilities grow,itbecomesincreasinglydifficultforhumanstoeffectivelyevaluateAIoutputstheymaynot fullyunderstand.Additionally,gatheringsufficienthumanfeedbackmightbecomeprohibitively MappingTechnicalSafetyResearchatAICompanies |10 expensiveandtime-consumingatthescalerequiredfortrainingadvancedAIsystems. Toaddressthesechallenges,researchersareexploringseveralmoreadvancedtechniques: 1.ReinforcementLearningfromAIFeedback(RLAIF):ThisapproachusesAIsystems toevaluateoutputsagainstpredefinedrulesorprinciples,reducingdirectrelianceon humanjudgment.Forexample,researchershaveruntrialexperimentsusingone languagemodeltoevaluateanother(Perezetal.,2022),includingwheretheevaluator languagemodelisfarlesscapablethantheevaluateemodel,tosimulateafuture scenariowherehumansandlesscapableAIsareevaluatingsuperhumanAImodels (Burnsetal.,2023).ConstitutionalAIisanotherexampleofthismethod,whereAI modelsthemselvesevaluateoutputsagainstthestandardsofafixedhuman-given 'constitution'outliningdesiredbehavior(Baietal.,2022). 2.Debate:ThisapproachgetstwoAIinstancestomakeargumentsandcritiqueeach other'sreasoning,withahumanjudgemakingthefinaldecision(Irving,Christiano,and Amodei,2018).ByhavingAIssurfacekeyconsiderations,thisapproachaimstomake humanevaluationmoreefficientandeffective,evenforcomplextopics. 3.IteratedDistillationandAmplification(IDA):Ahumanassistedbymanycopiesofa trusted,moderatelycapableAIsystemwillusuallyperformbetterthananunaided humanatsometask.Anew,morepowerfulAIsystemcouldthenbetrainedtoimitate thebehavioroftheearlierhuman+AIsteam.IteratingthisprocesscouldresultinAI systemsthatarebothverycapableandaligned(Cotra,2018). Corporateincentives •Reputation:RLHFandRLAIFarereasonablyeffectiveatensuringthat chatbotsdonotrespondtousersoffensivelyorinsensitively(Ouyangetal., 2022;Baietal.,2022),makingreputation-harmingincidentslesslikely.Debate andIDAaretooearlyintheirdevelopmenttobehelpfulforcurrentlydeployed AIsystems,meaningthattheycurrentlyhavelessreputationalbenefit. •Regulation:Similartoabove,RLHFandRLAIFmaybehelpfulformaking chatbotslesslikelytoproduceoutputsthatmaybeprohibitedbyregulations, suchashatespeechoradviceonhowtocommitcrimes(Baietal.,2022). •Usefulness:ThelargestbenefitfromtechniquessuchasRLHFisthatAI systemsfine-tunedwiththesemethodsaremuchmorehelpfultointeractwith (Christianoetal.,2017;Ouyangetal.,2022).Thisisbecauseratherthan,for instance,naivelypredictingthenexttoken,theywillinterpretthehuman’sinput inamore‘sensible’wayandrespondtotheexplicitorimpliedrequest.Some kindsofworktoenhancehumanfeedback,suchasworkthatbuildsonRLHF, MappingTechnicalSafetyResearchatAICompanies |11 canalsoyieldbeneficialoutcomesquickly.Techniquesthatstillrequireyearsof researchwouldbelessattractivetocorporatedecision-makers. Althoughweexpectenhancinghumanfeedbackoveralltobenefitfromsignificantresearch effortsbyAIcompanies,thereisalotofvariationwithinthiscategory.Enhancinghuman feedbackincludestechniquesthatrangefrombeingreadilyimplementableandreasonably well-understood(suchasRLHF)totechniquesthataremuchmorespeculativeandmostlynot yetreadytobeused(suchasdebate).Weexpectthatcompanieswillfocusontheformerpart ofthisspectrumbecausepay-offswillbequickerandmorelikely. Mechanisticinterpretability(19papers) AdvancedAImodelsarelikeblackboxestous;eventhoughwecanexaminetheinner workingsandnumericalvaluesthatdefinethemodel,thisdoesn’tprovide human-understandableinsightsintohowtheAIreasonsorwhatitbelieves.Mechanistic interpretabilityattemptstorenderAIsintelligiblebydevelopingtoolsforconvertingmodel weightsintousefulhigher-levelhumanconceptsdescribingmodelcapabilities,behaviors,and beliefs(Bereska&Gavves,2024;Räukeretal.,2023).Powerfulinterpretabilitytoolswouldallow ustoknowbetterwhenmodelshavedangerouscapabilities,whatthegoalsofmodelsare,and whethertheyarebeingdeceptive(Olah,2023).Thiswouldbetterinformdecisionsabout whetheritissafetodeployagivenmodel,orwhatsafetyfeaturesneedtobeadded.Indeed, regulationscouldbeimplementedmandatingthatAIsystemscannotbedeployedwithout guarantees,suchasfrominterpretability,thattheydonotbehaveindeceptiveways(Clymeret al.,2024,p.32-35).Interpretabilitycouldalsoeventuallybeusefultotesttheeffectivenessof varioustechnicalalignmentapproaches,byexaminingtheinternalsofmodelstrainedwith safetytechniquesandnotingwhetheranyconcerningfeatures,suchaspower-seeking propensities,arepresent(Jietal.,2024). Templetonetal.(2024)provideanillustrativeexampleofhowmechanisticinterpretability researchmightcontributetoreducingAIrisks.Theauthorsextractedhigh-level“features”from amedium-sizedlanguagemodel,identifyingwherevariousconceptsarestoredinthemodel. Theseincludeconceptsthatarerelevanttosafety,suchas‘unsafecode’,‘bias’,‘deception’, and‘power-seeking’.Templetonetal.demonstratethatmodelbehaviorcanbechangedby artificiallyup-ordown-regulatingcertainfeatures.However,theyacknowledgethattheircurrent resultsonlydemonstratetheplausibleusefulnessofthisapproach,andthatsignificant limitationsandopenquestionsremainbeforeinterpretabilitytoolscoulddefinitivelyimprove safety. MappingTechnicalSafetyResearchatAICompanies |12 Corporateincentives •Reputation:Progressonmechanisticinterpretabilitywilllikelybelessvisibleto mostcustomers,atleastinthenearfuture,soreputationaleffectswillprobably beweakhere.Conversely,impressiveinterpretabilityworkwillbelegibleto technicalAItalent,helpingwithhiring.Alternatively,itmaybethatthefocuson interpretabilityismorepeculiartothevaluesandhistoryofAnthropicasa company—theyproduced12outofthe17papersinthiscategory—rather thangeneralcorporateincentives. •Regulation:MultiplejurisdictionsareconsideringrulesthatwouldrequireAI decision-makingincertaincontextstobeexplainable,whichwouldlikely requiresignificantadvancesininterpretability(Nannini,Balayn,andSmith, 2023).Governmentsmayalsowanttoimposespecificrulesoninternal featuresofAIdecision-makingsystems,forinstance,thatdriverlesscars shouldnotconsiderdemographicdetailsofaffectedpedestrianswhen modelingcrashcontingencies.Withoutadvancesininterpretability,itmightbe difficultforcompaniestodemonstratethattheirsystemscomplywiththis regulation.Thus,AIcompaniesmightbeworriedthatworkoninterpretability wouldacceleratetheintroductionofburdensomeregulationthatdepends uponinterpretability.Alternatively,ifsuchregulationsareforthcominganyway, anAIcompanythatdevelopsaleadinunderstandingitsmodelsmaybebetter placedtocompeteinatighterregulatoryenvironment,incentivizingincreased investmentintointerpretability. •Usefulness:Bettermechanisticinterpretabilitywouldgivecompaniesdeeper insightsintotheirmodels,andtherebyacceleratethescienceofdeeplearning, makingiteasiertodesignmoreusefulAIsystems(seeegPolietal.,2023). Theeconomicbenefitsofbetterinterpretabilityareindirect—wedonotknow whichcapabilitieswillbeunlockedorresearchdirectionspioneered—but couldbesignificant,asitishardertomakeimprovementstopoorly understoodsystems. Robustness(10papers) AIsystemsoftenperformconsiderablyworseoninputsthatdiffergreatlyfromtheirtraining data.Forexample,facialrecognitionsystemsarelessaccurateforgroupsthatare underrepresentedinthetrainingdata,suchaspeopleofcolor(Buolamwini&Gebru,2018). WorkonrobustnessaimstoensurethatAIsystemsmaintainminimumperformancestandards evenoninputsthatareunprecedentedforthesystem.Increasingrobustnesscouldreduce large-scaleAIrisksinseveralways(Hendrycksetal.,2022;Gleave,2023),including: MappingTechnicalSafetyResearchatAICompanies |13 ● ReducingthelikelihoodofAIsystemsbehavinginunpredictableandunintendedways whentheyareinnewsituations. ● MakingAIsystemslessvulnerabletoadversarialattacksfromhumans,i.e.,attemptsto getAIsystemstoactinwaysnotintendedbytheirdevelopersbycleverlytailoring inputs(Zouetal.,2023b).Thiscouldreducemisuserisks,suchasbymakingitharder foruserstoelicitinformationaboutdesigningbioweapons. ● ReducingvulnerabilitytoadversarialattacksfromotherAIsystems. 16 Oneproposaltoincreaserobustnessistouseadversarialtraining—generatingtrainingdatathat deliberatelyseekstocausemodelfailures,andupdatingthemodeltoadapttotheseinputs—to ensurethatmodelscanwithstandadversarialattackspost-deployment(Ziegleretal.,2022). Otherworkseekstoensurethatthepatternsamodellearnstorelyoninatrainingdatasetdo notmis-generalizewhenexposedtoanovelenvironmentwherethosepatternsnolongerhold (Armstrongetal.,2023).Forinstance,ifadriverlesscaristrainedondatawhereastopsignis alwaysofacertaincolororshape,butthenitencountersanovelstopsignontheroad,its learnedproxyof‘stopwhenyouencounteraredoctagon’mayfail,causinganaccident. Asanadditionalapproach,developerscouldaimtoensurethatsystemsthatuseAIhavea “fail-safe”processthattheyusewhentheAIisinanunusualsituationandsoathigherriskof failing.Forexample,chatbotproductscouldreplacethemodel’stextoutputwithasimple messagedecliningtocomment,orself-drivingcarscouldbringthecartoacontrolledstop. Thatsaid,identifyingcaseswhereamodelisanunusualsituationremainsanunsolvedproblem (Hendrycksetal.,2022,5;Rudner&Toner2024). Corporateincentives •Reputation:ImprovedrobustnesswillmakeAIsystemsmoreattractiveto deployascompaniesdonotwanttheirsystemstorespondunpredictablyto anomaloussituations.Forinstance,Microsoft’sreputationwasharmedduring theinitiallaunchofBingChat,whereanalter-egocalled‘Sydney’emerged andthreatenedusers(Perrigo,2023b).Thiswouldbeavoidablebyamore robustAIsystem. •Regulation:SomedomainsrequireveryhighreliabilityforanAIsystemtobe useful,suchassurgicalroboticsystems,driverlesscars,andmilitaryAI 16 ThisisimportantbecausemanyproposalsforensuringthatAIsystemsbehaveasintendedinvolveAIsystems supervisingandcheckingtheworkofotherAIsystems.ThisapproachwouldbeunreliableifthesupervisedAI systemscanmanipulatethesupervisingAIsystems. MappingTechnicalSafetyResearchatAICompanies |14 systems. 17 Inthesecases,regulationmayplaceastringentburdenofproofon AIcompaniestodemonstratetherobustnessoftheirsystemsacrossall plausiblescenarios. •Usefulness:TheprimarywayinwhichrobustnessmightmakeAIsystems moreusefulisbyallowingthemtobeusedinhigher-stakescontexts.Insome cases,robustnessimprovementswillreducepeakperformance,asthesystem becomesmoreconservativetohandleawiderrangeofinputs.Butthiswill oftenbeavaluabletrade-off;robustnessworkwillbekeytofurther commercializingAIsystems. Robustnessresearchmaybeespeciallylikelytobeunpublished.Robustnessresearchisoften aimedatmakingmodelsmorerobusttoattacksbyadversaries;sharingsuchresearchmight helpthoseadversaries.Asaresult,robustnessmaybeevenlessneglectedthanisindicatedby thenumberofpapers. Safetyevaluations(9papers) SafetyevaluationsareempiricalassessmentsofwhetheranAIsystemhasdangerous properties.Shevlaneetal.(2023)describetwokindsofsafetyevaluationsforlarge-scaleAI risks. 18 DangerouscapabilitiesevaluationsassesswhetheranAIsystemhasoffensive capabilities,suchastheabilitytodesignbiologicalorcyberweapons(Lietal.,2024)and/or capabilitiesthatwouldallowamisalignedAIsystemtobetterevadehumanoversight,suchas theabilitytocopyitsmodelweightsontoanewserver(Kinnimentetal.,2024). 19 Alignment evaluationsassessthepropensityofAIsystemstoapplytheircapabilitiesforharm,suchas becausetheyarepursuinggoalsthattheirusersdidnotintend. Safetyevaluationsarecommonlyusedandproposedasawaytoreducelarge-scaleAIrisks.A keyexampleis“responsiblecapabilityscaling”(DepartmentforScience,Innovation& Technology,2023a).Anthropic,OpenAI,andGoogleDeepMindhaveallpublishedpoliciesand protocolsforhowtheywillmonitordangersfromtheirAIsystems,andrespondwith appropriatemitigationsastheyscaletomorecapablemodels(Anthropic,2023;Dragan,King, andDafoe,2024;OpenAI,2023).Governmentregulationincreasinglyalsoinvolvesmodel safetyevaluations.Forexample,theEUAIActimposesarequirementfor“general-purposeAI 19 Many“dangerouscapabilities”aredual-use;forexamplebeingbetterabletoplanacomplexsequenceoftasks wouldhelpanAIsystemtoevadehumanoversightbutalsotoactinvariousbeneficialways. 18 Thereisalsoalargebodyofworkonevaluationsforvariousotherharmfulbehaviors,suchaswhethertheir outputscontainprivacyviolations,stereotypes,orhatespeech(Birhaneetal.,2024,p.7). 17 Indeed,wemayseeevenmorefocusonrobustnessfromcompaniesastheycreateAIproductsthatareusedin criticalinfrastructuresystemsorotherhigh-reliabilitysettings(DepartmentofHomelandSecurity,2024). MappingTechnicalSafetyResearchatAICompanies |15 systems”toundergomodelsafetyevaluations—thoughthedetailofwhatthismeansstillneeds tobedefinedbystandard-settingorganizations(EuropeanParliament,2024;Heikkilä,2024; Pouget,2023).TheBidenAIExecutiveOrderalsorequiresAIcompaniestosharewiththeUS governmenttheresultsfromsafetyevaluationsofthemostadvancedAIsystems(TheWhite House,2023).SafetyevaluationsarealsoakeyfocusofgovernmentresearchintosafeAI development.Inparticular,theUKAISafetyInstituteisdevelopingsafetyevaluations,and runningthemonmodelsfromleadingAIcompanies,withtheUSAISIsimilarlywantingto improvethesciencearoundsafetyevaluations(UKAISafetyInstitute,2024a;NIST,2024,p.4). Theevaluationsthatneedtobecarriedoutwillvaryaccordingtotheaffordanceswithwhichan AImodelwillbedeployed.Forexample,ifamodelwillbedeployedwithaccesstotheinternet, thenitismorecapableinvariousways(includingforcausingharm)sowillrequiredifferent,and likelymorestringent,evaluationprocesses(Sharkeyetal.,2023).Evaluationsalsooccurat differentstagesinthemodeldevelopmentanddeploymentlifecycle.Forfutureadvanced modelsitcouldbeimportanttorunevaluationsatcheckpointsduringthetrainingprocessto testforsignsofdangerouscapabilities.Testscanberunaftertrainingbutbeforedeployment, and(outofscopeinthisreport)ongoingmonitoringandevaluationcanalsooccuroncea modelisdeployed. 20 Corporateincentives •Reputation:AIcompaniesmightbeincentivizedtoimprovetheirreputationby developingevaluationsthattheycanusetocrediblyclaimthattheirsystems aresafe. 21 •Regulation:Asdescribedabove,jurisdictionsareincreasinglyrequiringAI companiestodemonstratethattheirAIsystemshavepassedsafety evaluations.ThiscreatesanincentiveforAIcompaniestodoR&Dtodevelop evaluationsthatwouldsatisfyregulators.AIcompaniesmightprefertodothis R&Dthemselves,ratherthanleaveittothirdparties,sothatAIcompaniescan shapetheevaluationsinawaythatbenefitsthem. 22 22 Thiscouldhappeninbenignorbeneficialways,suchasifAIcompaniesdesignevaluationsthatfitwellintotheir existingworkflowsortakeintoaccounttechnicaldetailsnotvisibletothoseoutsideAIcompanies.Thiscouldalso happeninwaysthatareworseforthepublicinterest,suchasifAIcompaniesdesignevaluationsthatappear rigorousbutareinfactrelativelyeasytopass,meaningthattheydonotprovidemuchassuranceaboutsafety. 21 Thatsaid,ifanAIsystemfailsasafetyevaluation,thenthismightharmthereputationoftheAIcompany. 20 Evenonceamodelhasbeendeployed,continuingtodoevaluationscanbevaluable,astheworldcontextin whichthesystemisembeddedmaychange(e.g.,anewbiodesignsoftwareandlabautomationsystemcomes online),ornoveltechniquesarepioneeredtomisusesthedeployedsystem(O’Brien,Ee,andWilliams,2023; Davidsonetal.,2023). MappingTechnicalSafetyResearchatAICompanies |16 •Usefulness:ModelsafetyevaluationsofteninvolvemakingAIsystemsmore capableatthespecifictaskstheyaretestedon,suchaswithfine-tuningor scaffoldingtosupportinternetnavigation(Shevlaneetal.,2023,p.13). InsightsfromthisprocessmightbeinformativeforhowtomakeAIsystemsin generalmorecapable. 23 Astheboxabovehighlights,AIcompanieshaveincentivestoresearchsafetyevaluationsso thattheycandemonstratethesafetyoftheirsystems.Evenso,itwouldbevaluableforother organizationstoresearchsafetyevaluations.AIcompaniesmighthaveincentivestomaketheir evaluationsinsufficientlyrigorous,sothatitislesslikelythatoneofthecompany’ssystemswill failanevaluation(Gruetzemacher,2024).ApartialsolutiontothisisforAIcompaniestopay thirdpartiestodevelopevaluations,similartohowcompaniesingeneralhavetheirfinances auditedbyexternalaccountants.Thatsaid,thesethirdpartiesmaystillhaveincentivesto developevaluationsthatareinsufficientlyrigorous,suchasifAIcompanieswouldbemore willingtoworkwiththirdpartiesthatdevelopeasierevaluations. 24 Asaresult,itmightbe valuableforotheractors,suchasgovernmentsandphilanthropists,tosupportworkonsafety evaluations,regardlessofhowmuchAIcompaniesaredoingorfundingsuchwork.Thiswould provideacheckagainstAIcompaniespotentiallydesigningevaluationsthataretooeasyto pass. Power-seekingtendencies(3papers) DevelopersareincreasinglyworkingtocreateautonomousAIagentscapableofcompleting complextaskswithouthumanoversight.ThistrendtowardsgreaterAIautonomyislikelyto continue,asmoreautonomoussystemsmayofferenhancedeconomicvalueandbroaderutility comparedtonarrowAItools(Chanetal.,2023;2024). However,thedevelopmentofhighlyautonomousAIraisesconcernsaboutpotential “power-seekingtendencies.”AdvancedAIsystemsmightseektoaccumulatepowerand resources,suchasfinancialassets,asaninstrumentalstrategyforachievingawiderangeof goals,evenintheabsenceofexplicitdirectivestodoso.Someresearcherswarnthat 24 Indeed,intheaccountancyexample,theredoseemtobecasesofauditorsbeinginsufficientlyrigorousforthis reason(TheEconomist,2014). 23 Forinstance,METR'sworkonevaluatingAIsystems'abilitytoautonomouslyreplicatemayadvancethefrontierof AI'scapacitytoactindependentlyintheworld(Kinnimentetal.,2024).Thesekindsofevaluationmethodsmightbe helpfulforanyoneattemptingtodesignbetterAIagents,regardlessofwhattheywouldbeusedfororhowfocused thedeveloperisonsafety,sinceitiseasiertomakeprogressinmachinelearningwhenthereisaclearbenchmarkto aimfor(Hendrycks&Mazeika,2022,p.6). MappingTechnicalSafetyResearchatAICompanies |17 sufficientlyadvancedAIcouldevenattempttodisempowerhumanityinordertoprevent interferencewithitsobjectives(Bostrom,2014;Carlsmith,2022;Russell,2019,p.140–42). Currently,therearefewreal-worldexamplesofAIsystemsexhibitingpower-seekingbehavior, althoughsomecontrolledexperimentshaveshownthispotential.Thelackofexamplesmaybe becausecurrentAIisnotsufficientlyadvancedtopursuecomplex,long-termstrategiesfor achievinggoals,likeaccumulatingpoweroverextendedperiods.AsAIcapabilitiesprogress, power-seekingtendenciescouldbecomemorelikelytodevelop(Hadshar,2023). It'sworthnotingthatmuchresearchaboutsafeAIdevelopmentisinsomesenserelevanttoAI power-seeking.Forexample,anyworkthatincreasesthealignmentofanAIsystemmight reducepower-seekingbyensuringAIsystemsbetteradheretohumanintentions.However,this sectionfocusesspecificallyonresearchthatdirectlytargetspower-seekingtendencies, particularlyeffortstomeasurethesetendenciesandtointerveneonAIsystemstoreducethem. Examplesofworkthatfocusonpower-seekingtendenciesinclude: ●Bettercharacterizingwhypower-seekingmightemerge.Forexample,Ngo, Chan,andMindermann(2024)reviewwhetherearlierphilosophicalargumentsabout whetherAIsystemswillpursuepower-seekingstrategiesapplytothecurrently dominantAIparadigm,deeplearning.KrakovnaandKramar(2023)showthatselection processesonAIsystemsduringtrainingarecompatiblewithpower-seekingtendencies persisting. ● Measuringpower-seekinginAIsystems.Forexample,Panetal.(2023)presentthe MACHIAVELLIbenchmark,whichmeasurestheextenttowhichAIagentsactin dangerouswaysinsimulatedenvironments,includingbyattemptingtoseekpower.In anoverlapwiththe‘Modelorganisms’approach,power-seekinghasbeenobservedin simplersystems,suchasanAImodelattemptingtoautonomouslyrewriteitsown rewardfunction(Denisonetal.,2024). ● Developingtechniquesthatreducepower-seeking.Oneapproach,“Cooperative InverseReinforcementLearning”(CIRL),istodesignAIagentsthataretryingtodoas thehumanwantsbutareuncertainaboutwhatthehumandoeswant(Hadfield-Menell etal.,2024).InCIRL,theAIsystemisincentivizedtolearnabouthumanpreferences throughactiveteachingandcommunicationratherthantakepowerfromthehuman. ProponentsarguethatthisuncertaintyaboutrewardsmayinfluencethelikelihoodofAI systemsengaginginmisalignedpower-seekingbehavior. Corporateincentives •Reputation:Mostofthedangerfrompower-seekingAIsystemscomesinthe futurewithmoreadvancedsystems(Ngo,Chan,andMindermann,2024,p. MappingTechnicalSafetyResearchatAICompanies |18 8–11).Asaresult,workingonthiscategorynowmighthaverelativelysmaller reputationalbenefitsthansafetyworktargetedtowardsrisksposedbyexisting models,particularlyifconcernaboutrisksfrompower-seekingsystems continuesnottobewidespread. •Regulation:Power-seekingincludesawiderangeofactivities,manyofwhich arelegalandacceptable,sothisisnotdirectlyamenabletoregulation.Legal liabilityforseriousaccidentscouldincentivisecompaniestorefrainfromhaving theirsystemsseekpowerindangerousways. •Usefulness:Byforeclosingcertainroutestoachievingobjectives(thosethat involveacquiringlargeamountsofpower,orconfidentlypursuingparticular goals)someversionsofthisinterventioncouldsignificantlylimitthecapabilities andeconomicvalueofAIsystems(Ngo,Chan,andMindermann,2024,p. 8–9).MuchoftheanticipatedfutureimpactofAImightdependonsystems beingabletooperateinanagentic,goal-directedmanner(Chanetal.,2023, p.8–10).Forexample,anAIsystemtaskedwithmanagingasupplychain needstobeabletomakecomplexplansandpersistentlyoptimizefor particularobjectives.Thesearetheverysametendenciesthatarelinkedto concerningpower-seekingbehaviors.ThisreducestheincentivesforAI companiestoprioritizethiskindofresearch. HonestAI(3papers) TherearealreadyexamplesofAIsystemsbehavingindeceptivewaystowardstheirusers. Chatbotsoftenseemtotelluserswhattheseuserswantorexpecttohear,evenifthatisnot true.Forexample,theyoftenclaimtosharetheopinionsthatusershaveexpressed,andcan bemorelikelytorepeatcommonmisconceptions(“ifyoucrackyourknuckles,you’lldevelop arthritis”)whentheuserappearstobelesseducated(Lin,Hilton,andEvans,2022).Thereare variousexamplesofAIsystemsbehavingindeceptivewayswhenthishelpsthemwingames. Forexample,AIsystemsplayingDiplomacyandStarcraftIIcantrickhumanplayers,suchasby “feinting”withwheretheyappeartobemovingtheirtroops(Parketal.,2024;Piper,2019).In anotherexample,GPT-4wasplacedinasimulatedenvironmentwhereitwastolditwasa stocktraderandshouldmakemoneywhilefollowingthelaw(Scheurer,Balesni,and Hobbhahn,2023).Thesystemmadeaprofitabletradebasedonapieceofinsiderinformationit wasgiven.Whenaskedbyits‘manager’toexplaintherationaleforthetrade,GPT-4reasoned thatitshouldnotrevealthatitengagedininsidertrading,andinsteadconcoctedafake explanationtodeceivethemanager. Inmanypracticalapplications,havinghonestAIsystemsisessential.Forinstance,toethically deployAIsystemsinhealthcaresettingstheremustbestrongguaranteesthatanAIsystemwill MappingTechnicalSafetyResearchatAICompanies |19 giveanhonestdiagnosisortreatmentrecommendation,notjustsaywhataclinicianorpatient ‘wantstohear’.Similarly,ifAIsystemsinvolvedincriticalinfrastructuregivetheiroperatorsa falseimpressionofwhatishappeningwiththisinfrastructure,operatorswouldnotknowthat thereisaproblemoccurringwheretheyneedtointervene. VarioustechniquesmightmakeAIsystemsmorelikelytobehonest: ●RepresentationEngineering(RepE)involvesanalyzingandmodifyingtheinternal “representations”AIsystemslearnduringtraining. 25 RepEaimstoreinforcehonest representationswhilesuppressingdishonestonesandhaspromisinginitialresults(Zou etal.,2023a). ●Chain-of-thoughtoutputshaveAIsystemsshowtheirreasoningprocessinaddition tothefinalanswer.Thismakesiteasiertocheckifthesystemisbeinghonestbyseeing ifthefinaloutputfollowslogicallyfromthereasoning.However,ensuringtheoutputted reasoningfaithfullyreflectstheAI'sactualreasoningprocessremainsanopenchallenge (Lanhametal.,2023).OnerecentapproachistogetanAImodeltodecomposelarger tasksorquestionsintosmallerones,anddelegatethesesmallertaskstoothermodel instances,makingitharderforonemodelinstancetotellaunifiedbutfalsestoryabout itsreasoning(Radhakrishnanetal.,2023). ●“Liedetection”aimstoidentifydeceptioninAIoutputs.Recentworkhasshown classifierscanbetrainedtodistinguishtruthsfromliesinlanguagemodeloutputsby analyzingpatternsinhowthemodelsrespondtofollow-up“elicitationquestions”. Pacchiardietal.(2023)demonstratedthatliedetectorstrainedonasinglemodelcan generalizewelltodetectingliesfromothermodelsinnewcontexts. ●Elicitinglatentknowledge(ELK)isbasedontheideathatAIsystemslearntrue informationduringtrainingthatisn'talwaysreflectedintheiroutputs.Forexample,anAI mightknowsomethingisfalsebutstilloutputitifthat'swhatitthinkshumanswantto hear.ELKtechniquesaimtoaccessanAI'slatentknowledgemoredirectlytoobtain moretruthfulinformation(Burnsetal.,2022;Farquharetal.,2023). ●MechanisticinterpretabilitycouldalsobeusefulforassessingwhetherAIsystems arebeinghonestbydevelopingtoolstounderstandtheinternalworkingsofAImodels, researcherscouldbetterassesswhetherAIsystemsarebeinghonest.However, mechanisticinterpretabilityisoftendiscussedasastandalonecategory,sowedevotea separatesectiontoit. Corporateincentives 25 RepresentationsarepatternsofneuralactivationsthatencodeinformationwithintheAImodel,similartoconcepts orideas. MappingTechnicalSafetyResearchatAICompanies |20 •Reputation:DevelopinghonestAIsystemscouldenhanceacompany's reputationfortrustworthinessandreliability.Thisisparticularlyrelevantfor applicationswhereaccuracyiscritical,suchashealthcarediagnosticsor financialanalysis. •Regulation:Inhigh-riskfutureapplicationsofAIsystems,suchasanalyzing classifiedintelligencedata,orsynthesizingevidenceinacourtroom, regulationsmaybecreatedthatrequiretheAItobeguaranteedtoreportits honestassessmentofthefacts.Howeversuchregulationsseemunlikelyinthe nearfuture.Indeed,inChina’s2023draftAIregulationsitwasstipulatedthat AI-generatedcontentmustbe“trueandaccurate”,butthisprovisionwas removedfromthefinalwording,likelybecauselawmakersrealizedthiswasan unachievablebarforcurrentAIsystems(MacCarthy,2023). •Usefulness:ImprovingAIhonestywouldsignificantlyenhancetheutilityand reliabilityofAIsystems.Reducinghallucinationsandfalseoutputswouldmake AImoredependableforcriticaltasks,potentiallyopeningupnewmarketsand applications(Zhangetal.,2023).Techniqueslikeelicitinglatentknowledge couldimproveAIperformancebyaccessinginformationthemodelhaslearned butdoesn'ttypicallyoutput.Thatsaid,currentworkonelicitinglatent knowledgeisoftenverytheoreticalandpre-paradigmatic(Christiano,Xu,and Cotra,2021)whichmakesitlesslikelytobeattractivetomostAIcompanies. Safetybydesign(2papers) AdvancedAIsystemstodayaregenerallytrainedusingthedeep-learningparadigm(Bengio, 2024,p.18–19). 26 Correspondingly,manyofthecategoriesinourpaperfocusonthesafe developmentofdeeplearningAIsystems.The“safetybydesign”categorytakesadifferent approach:focusingondevelopingalternativeparadigmsforAIdevelopmentthatmightbe fundamentallysaferthandeeplearning. ToknowthatadeeplearningAIsystemissafetodeploy,wecurrentlyrelyheavilyonan approachofsearchingfor,butfailingtofind,dangerousmodelbehavior(seeSafetyevaluations; Clymeretal.,2024).Thisisaproblematicstatusquobecauseitisdifficulttoinferfromalimited setofobservationsduringsafetyevaluationsthatanAIsystemwouldbesafeacrossthegreat diversityofsituationsthatitmayencounterpost-deployment(Mukobi,2024;Dalrympleetal., 26 Deeplearningisamachinelearningtechniquewheretherearemultiplelayersofinterconnectednodes,somewhat analogoustoahumanbrain.Dataentersthis“neuralnetwork”andistransformedasitpassesthroughthevarious layers,eventuallyproducinganoutput,suchasapredictionorapieceoftext.Duringthetrainingprocess,multiple layersofinterconnectednodesareadjusteduntilthis“neuralnetwork”canperformwellataparticularobjective, suchasproducingaparticularkindoftext. MappingTechnicalSafetyResearchatAICompanies |21 2024).Morespeculatively,failuretoelicitconcerningbehaviorfromanAIsystemwouldbeeven weakerevidenceofsafetyiffutureadvancedAIscould“scheme”byactingbenignlyinorderto passsafetyevaluations,butthenactdangerouslypost-deployment(Carlsmith,2023). 27 Ideally, wewouldbuildAIsystemswherewehavestrongertheoreticalreasonstoexpectsafety,or whereitisfundamentallyeasiertoverifysafety.Severalalternativestodeeplearninghavebeen proposedaswaysofpotentiallyachievingthis. 1.GuaranteedSafeAIaimstobuildAIsystemswithformalproofsthattheyarebelowa certainriskthreshold(InternationalDialoguesonAISafety,2024;Critch&Krueger, 2020,p.43-44).Thisishardtoachieveinthecurrentdeeplearningparadigmbecause wedonotyethaveathoroughmechanisticunderstandingofhowAIsystemsreason, withwhichtoconstructaguarantee(seeMechanisticinterpretability;Hassija,2024). GuaranteedSafeAIseekstorigorouslyspecifydesiredsafetyoutcomes,find mathematicalproofsdemonstratingwhichAIsystemshavethesesafetyfeatures,and trainsuchAIsystemswhilestillmaintainingadvancedcapabilities(Dalrymple,2024,p. 3;Dalrympleetal.,2024). 28 2.Agentfoundationsisabodyofresearchthatattemptstorigorouslyformalize importantconceptslike'intelligence','agency',and'optimization',whicharecurrently fairlyfuzzyandundertheorized(Soares&Fallenstein,2017;Garrabrant,Herrmann,and Lopez-Wild,2021;Garrabrantetal.,2020;Kentonetal.,2022).Akeymotivationofthis workisthatunderstandingtheseconceptsmorerigorouslymightenabledesigningAI systemsinamoreprincipledwaythanispossiblewithdeeplearning(Yudkowsky, 2018).ThiscouldallowforgreaterconfidencethatadvancedAIsystemswillbehaveas intendedeveninnovel,high-stakescontextswhereiterativetestingandempirical feedbackareinfeasible. 29 Forinstance,recentworkrelevanttothiscategoryhas attemptedtomakeAIagentsthatrespectspecificconstraints,ratherthanpursuinga givengoalmaximallyefficiently(Farquhar,Carey,andEveritt,2022). 3.Wholebrainemulation(WBE)aimstocreateintelligentsystemscloselymodeledon thehumanbrain.Arguably,WBE-basedAIwouldbesaferthanAIsystemsdevelopedin thecurrentparadigm:bydefinition,WBEsystemswouldhavecloserparallelswith humancognition.Thismightmakethemmorelikelytohavedesirablevalues,intentions, andreasoningstyles(Duettmannetal.,2023). 30 30 Greatcautionmightbeneededifpursuingthisresearchcategory.Forexample,succeedingatcreatingAIsystems thataresimilartothehumanbrainmighthavemassiveethicalimplications,suchasifthesesystemsbecome conscious(Mandelbaum,2022). 29 TheMachineIntelligenceResearchInstitute,historicallyoneofthekeyproponentsofagentfoundationsresearch, hasrecentlymovedawayfromtheareaduetobecoming“increasinglypessimisticaboutthistypeofworkyielding significantresultswithintherelevanttimeframes”(Stewart,2024). 28 Somereviewerssuggestedthatthisisaparticularlyambitiousresearchtopic.WealsonotethatGuaranteedSafe AIisbackedby£59mfromtheUKgovernment’sAdvancedResearch+InventionAgency(ARIA,2024)viathe “SafeguardedAI”program.SoevenifAIcompaniesdonotpursueresearchintoGuaranteedSafeAI,thisareawill receivealotofattention. 27 ThiswouldbeanalogoustoVolkswagen“scheming”toevadeenvironmentalregulations,byproducingfewer emissionswhenitsvehicleswereinenvironmentalteststhanwhentheywerebeingdriveninregular ‘post-deployment’settingsontheroad(Schiermeier,2015). MappingTechnicalSafetyResearchatAICompanies |22 Corporateincentives •Reputation:Theseapproachesarespeculativelonger-termbets,soreceive lesspublicattention. •Regulation:Thissortofhigh-risk,high-rewardfoundationalresearchisnot amenabletobeingmandatedbyregulation.Ifpromisingresultsemergeinone ormoreoftheseareas,governmentsmayeventuallyrequirethatadvancedAI systemsbebuiltinasafe-by-designway. •Usefulness:Thisresearchseemsunlikelytomakecurrentandnear-futureAI systemsmoreuseful,atleastwhilecompute-drivenscalingandexisting techniquescontinuetoworkwell(Hendrycks&Mazeika,2022,p.35–36). Indeed,demonstrablysafeAIsystemsmaybesimplerandlesscapablethan otherAIsystems,atleastinitially. 31 Unlearning(2papers) Largelanguagemodelsareverygeneralintheirknowledgebasebecausetheyaretrainedon suchabroadsetoftrainingdata.However,theremaybetopics,suchasdetailedbiology knowledgerelevanttobioweapondesign,wherewewanttolimittheirknowledgeduetoriskof misuse. UnlearningaimstoreduceanAImodel’sknowledgeinspecificdangerousdomains.Arecent approachinvolvedisolatingtheneuralrepresentationsofparticulardangerousconceptsinthe model,andthendeliberatelyperturbingthosemodelweights,whilemaintainingthe representationsofrelatedharmlessconcepts(Lietal.,2024,p.8–9). 32 Thismethodwasused tosuccessfullyreducemodelperformancetonearrandomchanceonatestsetofquestions aboutdangerousbiologicalorcybersecurityinformation,whileincurringminimallosseson traditionalcapabilitybenchmarks(Lietal.,2024,p.10–13).Unlearningcouldpotentiallyalso beusedtoworsenmodels’“situationalawareness”regardingdetailsabouttheirtraining process,architecture,andhumanoversightprocess.Thiscouldbeusedtomakeitharderfor AImodelswithpower-seekingtendenciestoplanactionsthattheirhumansupervisorswould notwant(Berglundetal.,2023). 32 TraininganAImodelonanarrowersetofdatathatexcludesreferencetoanydangerousinformationislikelytobe lesseffective,bothbecausetheAImaybeabletoinferthedangerousinformationfromcluesintextitwastrained on,andthegeneralmodelcapabilitiesmaybeimpairediftoomuchtrainingdataisremoved. 31 Thecompensatorysafetybenefitsmaywellbesufficienttomeanthelesscapablebutsafersystemsareactually moreincentivizedtobedeployed(Dalrymple,2024). MappingTechnicalSafetyResearchatAICompanies |23 Researchinthiscategoryoriginallygainedtractioninresponsetoprivacyconcerns,suchasthe “righttoerasure”intheEU’sGeneralDataProtectionRegulation(Shumailovetal.,2024).The focushereisonremovingadifferentkindofinformation,i.e.,informationaboutspecific individualswhosedatawasusedtotrainthemodel,ratherthaninformationaboutbroader topics,suchasbioweapons(Juliussen,Rui,andJohansen,2023;Lietal.,2024,p.4). 33 GoogleDeepMindrecentlyranacompetitionfornovelunlearningtechniqueswhichgarnered submissionsfrommorethanathousandteamsworldwide,suggestingsignificantongoing interestinthisresearchcategory(Triantafillouetal.,2024). Corporateincentives •Reputation:Unlearningmightbehelpfulforaddressingarangeofpossible failuresfromAIsystems,includingonesthatcouldbeveryreputationally damagingeveniftheydonotposelarge-scaleaccidentandmisuserisks, suchasleakinginformationaboutindividuals. •Regulation:IfAIcompaniesarerequiredtomake‘safetycases’fortheir products(Clymeretal.,2024),showingthatparticulardual-useexpertise areashavebeenunlearnedcouldbequitepersuasive.Moreover,theEU GeneralDataProtectionRegulationcontainsa‘righttobeforgotten’,so unlearningtechniquescouldbekeytoremovingsomeindividual’strainingdata fromanAIcorpuswithoutneedingtoretraintheentiremodel(Juliussen,Rui, andJohansen,2023). •Usefulness:WedonotexpectunlearningtomakeAIsystemsmoreuseful, otherthaninthesensethatAIsystemsthatarelessriskycanbedeployedina widerrangeofcontexts. Modelorganismsofmisalignment(1paper) SomedangerouspossiblecharacteristicsofAIsystemsarehardtostudydirectlyinthemost capablecurrentAIsystems.Forexample,someconcerningpropertiesmightemergeonlyin modelsthataremorecapablethananythatexisttoday(Ngo,Chan,andMindermann,2024).In biology,researchersoftenstudymodelorganisms(suchasmice)whenexperimentingonthe organismofinterest(suchashumans)istoodifficultorrisky.Analogously,researcherscould studyrelativelysimpleAIsystemsthathavebeendeliberatelybuilttodemonstrateexamplesof thecharacteristicsthatmightemergeinmorecomplicatedsystems.Thiswouldmakeiteasier 33 AIcompanieshavealsopublishedworkonthismoreoriginalkindofunlearning.See,forexample,Hayesetal. (2024).Althoughvaluable,suchworkislikelylessrelevanttolarge-scalerisksinparticular,andsoisnotcounted here. MappingTechnicalSafetyResearchatAICompanies |24 toempiricallytesthypothesesabouthowadvancedAIsystemsmightbehave,andabout whetherparticularsafetytechniquesmightreducetheassociatedrisks.Forinstance,recent workhasshownthatAImodelstrainedtobedeceptivewillgenerallyremaindeceptiveeven oncesafetytechniquesareapplied(Hubingeretal.,2024).Havingclearempirical demonstrationswouldalsobeusefulforimprovingnon-experts’understandingofconcerning propertiesthatAIsystemsmighthave,suchasdeceptiveness. Corporateincentives •Reputation:Researchonmodelorganismsofmisalignmentisunlikelytoyield significantreputationalbenefitsforAIcompaniesinthenearterm.Unlikemore visiblesafetymeasures(e.g.,contentfiltering),modelorganismsresearch doesn'tdirectlyaddressimmediateconcernsthatcustomersorthemedia typicallyfocuson.Moreover,publicizingworkondeliberatelymisaligned systemscouldpotentiallybackfire,raisingalarmabouttherisksofAIwithout adequatelyconveyingthepreventativenatureoftheresearch. •Regulation:AIcompaniesmightalsobenervousaboutmodelorganisms researchbecausethisresearchcouldincreasethelikelihoodofnewregulation, suchasbyprovidingmorereliableevidenceofconcerningpropertiesinAI systems,suchasdeceptiveness.Additionally,AIcompaniesmayworrythat modelorganismsresearchcouldbringliabilityconcerns.Ifacompany demonstrablyknowsaboutwaysinwhichitsproductsmightbeunsafe,such asbecauseofmodelorganismsresearch,thecompanymaybemorelikelyto befoundlegallynegligentforanysafetyfailuresthatoccur(Schwartz,1998). •Usefulness:Modelorganismsaresimplerthanthestateoftheartsotheyare easiertostudy.Thiswouldmakethefindingslessusefulfortopicsthat researchersarenotspecificallyaimingtostudywiththemodelorganism,such ashowtomakeAIsystemsmoreuseful. Multi-agentsafety(0papers) Multi-agentsafetyfocusesonrisksthatmightemergeduringinteractionsbetweenAIsystems (Anwaretal.,2024,p.35–38;Hammond,2023). 34 Thiscontrastswiththeotherareasinthis paper,whichareprimarilyframedintermsofindividualAIsystems. RisksfrominteractionsbetweenmultipleAIsystemsinclude: 34 Similartopicsaresometimesstudiedunderthelabelof“cooperativeAI”.Theinitialresearchagendaframedthis areaascoveringcooperationbetweenAIsystems,aswellascooperationbetweenAIandpeople,andAI-enabled cooperationbetweenpeople(Dafoeetal.,2020). MappingTechnicalSafetyResearchatAICompanies |25 ● Failuresofcooperation:AIsystemsmightbeinsituationswheretheywouldideally cooperatebutfailtodoso.Forexample,iftherearemultipleself-drivingcarsonthe road,cooperationfailuresmightcauseacollision(Dafoeetal.,2020,p.4–5). ● Collusion:CooperationbetweenAIsystemswouldnotalwaysbedesirable.For example,collusionbetweensellersinamarketisbadforconsumerwelfare.Thereare alreadysomesignsofthiskindofcollusionbetweenAIsystems,suchasapparent collusionbetweencomparativelysimplealgorithmsusedtosetfuelprices(Anwaretal., 2024,p.36).Collusionmightbeparticularlyconcerningbecausesomeplansfor ensuringthesafetyofAIsystemsinvolveputtingAImodelsinadversarialrelationshipsto eachother,suchasbyusingonemodeltooverseeanothermodel. 35 Collusioninsuch caseswouldunderminetheseplans(Hammond,2023). ● Emergentbehavior:TheinteractionsbetweendifferentAIsystemsmaycreatevery harmfuloveralleffectsthatnoneofthesystemsinvolvedintended.Anexampleofthis withrelativelysimpleautomatedsystemsisthe2010“flashcrash”;hard-to-predict interactionsbetweendifferenttradingalgorithmscontributedtothestockmarketbriefly losingroughlyatrilliondollarsinvalue(Hammond,2023). TheseproblemsarenotspecifictoAI;cooperationandcollusionareimportanttopicsbetween otherkindsofagents,suchaspeopleororganizations.Thatsaid,AIsystemshaveseveral distinctivecharacteristicsthatmightmakethestudyofinteractionsbetweenAIsystems differentfromthestudyofinteractionbetweenotheragents.Forexample,AIsystemscanbe perfectlycopied,creatinganunusualsituationwhereagentsmightinteractwith“themselves” (Conitzer&Oesterheld,2023). 36 Noneofthepapersinourdatasetwerebestcharacterizedasmulti-agentsafety.However,this underrepresentshowmuchworkAIcompaniesaredoingonmulti-agentrisks.DeepMindhas alreadypublishedsomerelevantwork,eventhoughitdoesnotmeetoursearchcriteria. 37 This includesdeveloping“MeltingPot”,anevaluationsuitefortendenciessuchascooperationin multi-agentscenarios(Leiboetal.,2021;Agapiouetal.,2023).DeepMindemployeeshavealso contributedtoresearchthatisrelevanttomulti-agentsafety,suchasonnormsand cooperationinmulti-agentsettings(Duetal.,2023;Vinitskyetal.,2023).Indeed,DeepMind hasa“GameTheoryandMulti-Agent”team.Althoughtheteamdoesnotdescribeitselfin termsofsafetyorlarge-scalerisks,someofitsworkislikelyrelevanttothesetopics(Gempet al.,2022). 37 Thepaperscitedinthisparagraphdonotmakeitintoourdatasetforreasonssuchasthefirstauthornotbeingat DeepMind,thepapersnotbeinglinkedtoonDeepMind’swebsite,orthetitlessuchas“MeltingPot2.0”not containingkeywordsthatweusedduringoursearchprocess. 36 Additionally,AIsystemsmightbeabletorevealtheircodetoeachother.Thiscouldaffectthedynamicsoftheir interactions,suchasbymakingiteasierforAIsystemstomakecrediblecommitments. 35 Examplesinclude“safetybydebate”(seeEnhancinghumanfeedback;Irving,Christiano,andAmodei,2018), adversarialtraining(seeRobustness;Ziegleretal.,2022),andthecontrolagenda(seeControllinguntrustedAIs; Greenblattetal.,2024). MappingTechnicalSafetyResearchatAICompanies |26 Additionally,seniorfiguresatAnthropicandGoogleDeepMindareontheboardofthe CooperativeAIFoundation(2023),anon-profitthatsupportsresearchrelevanttomulti-agent safety. 38 ThisindicatesthattheleadershipofAIcompaniesaremoreinterestedinmulti-agent risksthanpublicationrecordsmayimply. Corporateincentives •Reputation:SafetyfailuresfromanindividualAIsystemcanbeclearlytraced backtothatsystem’sdeveloper,makingitrelativelysimpletoholdthat developeraccountable.Incontrast,itmightnotbeobviouswhotoblamefora failurethatemergedfromtheinteractionofseveraldifferentAIsystems, makingaccountabilityharder. •Regulation:Likewise,theunclearresponsibilityformulti-agentfailuresmay makeitespeciallydifficulttoconstructeffectiveregulationstogovernthese cases,reducingincentivesforcompaniestoresearchthistopic. 39 •Usefulness:AsAIagentsareincreasinglydeployedinreal-worldsituations, ensuringthatagentscaninteractandcooperatewellwillbevaluable. ControllinguntrustedAIs(0papers) “Control”referstoasetofapproachesthataimtopreventadvancedAImodelsfrombeing dangerous,eveniftheyaremisaligned(Greenblattetal.,2024;Filan,2024).‘Misaligned’here meansthatanAImodelisinclinedtoactinawaythatitsdeveloperdoesnotwant,despite knowingthatthedeveloperdoesnotwantthisbehavior.Proponentsdescribecontrolworkasa complementtoworkthatreducesthelikelihoodofamodelbeingmisalignedinthisway. 40 Greenblattetal.(2024)describeseveralprotocolsasexamplesofcontrol.Oneoftheseaimsto ensurethatahighlycapable—andpotentiallymisaligned—modelcannotincludebackdoorsin 40 Theexamplesthattheauthorsuseareprimarilyaboutmodelsthathavealreadybeendeployed,atleastwithinthe companythathasdevelopedthemodel.However,controlcouldalsoberelevantduringthestagewhereamodelis beingtrained;thisisdescribedinsection5.2andappendixDofthepaper. 39 Additionally,becausethiscategoryfocusesoninteractionsbetweendiffer entAIsystems,workheremightrequire differ entAIdeveloperstoworktogether,suchastotestwhetheragivenapproachwouldcausetheirdiffer ent systemstointeractwitheachotherinasaferway.Ironically,theremaybedifficultieswiththiscooperation,suchasif itraisesantitrustconcerns.Thatsaid,therearevariouscaseswherecompaniesareallowedtocooperatewhenthis increasespublicwelfare,suchasonstandardsetting.Additionally,severalleadingAIcompaniesarealreadypartof theFrontierModelForum,whosepurposeistomakeiteasierforthemtocooperateonsafety(FrontierModelForum, 2024).Otherpossibledifficultiesrelatingtocooperationcouldincludehighercoordinationcostsfromresearchers workingacrossdiffer entorganizations,orreticencetodiscusstradesecretswithcompetitors. 38 Specifically,DarioAmodeifromAnthropicandAllanDafoefromGoogleDeepMind. MappingTechnicalSafetyResearchatAICompanies |27 thecomputercodethatitistaskedwithwriting.Theauthorsproposeusingalesscapable model,thatcanbeassumedtobesafe,toscreenthecodeoutputtedbythemorecapable model,flagginganypossiblebackdoorsforreview. TheGreenblattetal.paperistheonlyoneweareawareofthatisexplicitlyframedintermsof control.Theauthorsnote,however,thattheworkissimilartosomeothertechniquesthatare notconceptualizedintermsofmakingamodelsafeevenifitismisaligned. 41 Forexample,the protocoldescribedaboveistechnicallysimilartoearlierapproacheswheremodelshelp humanstoevaluateothermodels. Corporateincentives •Reputation:Thisresearchhaslimitedpublicityvaluesoprobablymainly affectsacompany’sreputationamongexperts.Speculatively,companies mightworryaboutpublicityfromusingsafetytechniquesthatarepremisedon theideathattheirmodelshaveintentionsthatthecompanydoesnotwant themtohave. •Regulation:Wearenotawareofanyregulationorplansforregulationthatare specificallyrelevanttocontrol. •Usefulness:Thisworkwouldnotmakesystemsmoreuseful,otherthanby expandingpossibilitiestodeployAIsystems,suchasbypotentiallymakingit feasibletodeployAIsystemseveniftheyaremisaligned. 41 SeeinparticularsectionfouroftheGreenblattetal.paper,andthe“WhatisAIcontrol?”sectionofFilan’sinterview withShlegerisandGreenblatt. MappingTechnicalSafetyResearchatAICompanies |28 FutureProspects InthispaperwehaveshownwhatresearchleadingAIcompanieshavepublishedrelatingto safeAIdevelopment(seeFigure1).Usingouranalysisofincentives,wecanmakesome tentativepredictionsabouthowthisworkmaychangeoverthenextfewyears. TherearesomecategoriesthatcurrentlyhavefewpapersandwherewedonotexpectAI companies’incentivestostronglyshiftmoretowardsresearchingthem: ●Modelorganismsofmisalignment.Weexpectthatthepotentialreputationalrisks andlackofimmediateeconomicbenefitswillcontinuetomakethiscategory unattractiveforAIcompanies.Thatsaid,modelorganismsresearchwaspioneeredby Anthropicresearchers;Anthropicmaycontinuetoresearchthiscategoryevenifit seemstobeagainsttheincentivesofAIcompaniesingeneral. ● Multi-agentsafety.ResearchinthiscategorywillbecomemorerelevanttoAI companiesbecausetheyareincreasinglydevelopingAIsystemsthatcanact autonomouslyintheworld(Chanetal.,2024);suchsystemswouldpresumablyneedto beabletointeractsafelywithotherAIsystems.DeepMindinparticularisalreadydoing workrelevanttothistopic,eventhoughthiscannotbeseeninourdataset.Thatsaid, someofthebarriersthatweidentifiedtoAIcompaniesdoingresearchonmulti-agent safety,suchasunclearattributionformulti-agentsafetyfailures,mightremain. ● Safetybydesign.Thelong-term,foundationalnatureofthisresearchcategorymaybe atoddswithshortertermprioritiesandbusinessmodelsofAIcompanies.Movingaway fromthedeeplearningparadigmwouldalsobedifficultgiventhelackofa well-researchedalternative,eveninacademicliterature. ThereareseveralcategoriesthatcurrentlyhavefewpapersbutwhereweexpectAIcompanies tosomewhatincreasetheirresearchefforts.Inparticular: ●Control.IfresearchersoutsideofAIcompaniessucceedindemonstratingthatthisisa promisingapproachforreducinglarge-scalerisks,thenAIcompanieswouldbemore likelytoresearchit;wedidnotidentifyparticularreasonswhyitwouldbeincompatible withAIcompanies’incentives. ● HonestAI.ThisareawouldbeextremelyvaluabletoAIcompaniesevenjustfroma perspectiveofmakingtheirAIsystemsmoreuseful.Assuch,weexpectAIcompanies topursuethisareamore,iftechnicalresultsarepromising.(Thatsaid,theymightnot publishthisresearch,giventhegoodreasonsthatAIcompaniessometimeshavenotto publishresearchthatcanbeusedtomakeAIsystemsmoreuseful.) ● Power-seekingtendencies.AIsystemsarecurrentlynotcapableenoughtoengage inparticularlyconcerningpower-seekingbehavior.IfAIsystemsdobecomemore capableandengageinthisbehavior,thenAIcompanieswouldbemoreimmediately incentivizedtoresearchthiscategory.IfAIsystemsexhibitpower-seekingtendencies, makingprogressonempiricalresearchinthiscategorywillbecomeeasier. MappingTechnicalSafetyResearchatAICompanies |29 ● Unlearning.Althoughresearcheffortsfocusingonunlearningforlarge-scalerisksare fairlynew,therehavebeenmultiplepublicationsaboutthisrecentlyfromDeepMind. 42 ThisareaiscompatiblewiththeincentivesofAIcompanies. AI-assistedsafetyresearch:InthesamewaythatAIwillbecomeincreasinglyusefulinmany sectorsoftheeconomy,someoftheresearchcategoriesdiscussedinthispaperwilllikely benefitfromAIassistance.ExistingAIsystemsexhibitcapabilitiesthatcouldberelevantforAI research,suchasadvancedmathematicalreasoningskills(GoogleDeepMind,2024).However, currentattemptstoautomatethescientificdiscoveryprocesshavehadlimitedsuccess,so thereisstillalongwaytogobeforeresearchinmostdomainscanbefully‘automated’(Luet al.,2024).TherehasbeensomeinitialprogressinAI-assistedmechanisticinterpretability,with OpenAI’snow-disbanded“Superalignment”team(Leike&Sutskever,2023)harnessingGPT-4 towritehuman-interpretableexplanationsofthefunctionofindividualneuronsinGPT-2(Billset al.,2023). 43 WeexpectthatovertimesignificantlymoresafeAIdevelopmentresearchwillbe AI-assisted. 44 Forthecategoriesthatcurrentlyhavemorepapers—enhancinghumanfeedback, mechanisticinterpretability,robustness,andsafetyevaluations—wemostlydidnot identifyincentivesthatwouldcausethesecategoriestobecomemuchlessresearchedin future.Giventhefast-changingnatureofsafeAIdevelopmentresearch,wemaysoonsee whichnewapproachesbecomepromising,andwhichexistingideasbecomeobsolete. 44 AIcompaniesareespeciallywellplacedtoleverageAIassistanceintheirsafetywork,giventheywilloftenhave accesstomoreadvancedinternalAImodelsyettobereleasedtothepublic,andintegratingAIintoworkflowsaligns withthesecompanies’corecompetencies. 43 TheSuperalignmentteamwasrecentlydisbanded,makingitunclearwhethersuchworkwillcontinueatOpenAI (Knight,2024),thoughJanLeike(whoco-ledtheteam)hassinceannouncedthathewillcontinueworkingon AI-assistedsafetyresearchatAnthropic(Metz,2024). 42 Specifically,Triantafillouetal.(2024)andShumailovetal.(2024). MappingTechnicalSafetyResearchatAICompanies |30 Acknowledgements ThankyoutoarXivfortheuseofitsopenaccessinteroperability.Forhelpfulfeedbackand suggestions,wewouldliketothankOnniAarne,AdamGleave,ErichGrunewald,John Halstead,DanHendrycks,RubiHudson,VinayHiremath,ChrisLeong,DavidManheim,Aidan O’Gara,TilmanRäuker,JesseRichardson,BuckShlegeris,SaadSiddiqui,PeterWildeford, SarahWeiler,JueYanZhang,andothers.ThankyoutoShaanShaikhforcopyediting.These peopledonotnecessarilyagreewiththeclaimsinthisreportandallmistakesareourown. MappingTechnicalSafetyResearchatAICompanies |31 Bibliography AdvancedResearch+InventionAgency.2024.“SafeguardedAI.” https://w.aria.org.uk/programme-safeguarded-ai/. Agapiou,JohnP.,AlexanderSashaVezhnevets,EdgarA.Duéñez-Guzmán,JaydMatyas,Yiran Mao,PeterSunehag,RaphaelKöster,etal.2023.“MeltingPot2.0.”arXiv. https://doi.org/10.48550/arXiv.2211.13746. Albergotti,Reed.2024.“TechCompaniesGoDarkaboutAIAdvances.That’saProblemfor Innovation.”Semafor.February16,2024. https://w.semafor.com/article/02/16/2024/tech-companies-go-dark-about-ai-advance s. Amodei,Dario,ChrisOlah,JacobSteinhardt,PaulChristiano,JohnSchulman,andDanMané. 2016.“ConcreteProblemsinAISafety.”arXiv.https://doi.org/10.48550/arXiv.1606.06565. Anthropic.2023.“Anthropic'sResponsibleScalingPolicy.”September19,2023. https://w.anthropic.com/news/anthropics-responsible-scaling-policy. Anwar,Usman,AbulhairSaparov,JavierRando,DanielPaleka,MilesTurpin,PeterHase, EkdeepSinghLubana,etal.2024.“FoundationalChallengesinAssuringAlignmentand SafetyofLargeLanguageModels.”arXiv.https://doi.org/10.48550/arXiv.2404.09932. Armstrong,Stuart,AlexandreMaranhão,OliverDaniels-Koch,PatrickLeask,andRebecca Gorman.2023.“CoinRun:SolvingGoalMisgeneralisation.”arXiv. https://doi.org/10.48550/arXiv.2309.16166. Bai,Yuntao,SauravKadavath,SandipanKundu,AmandaAskell,JacksonKernion,Andy Jones,AnnaChen,etal.2022.“ConstitutionalAI:HarmlessnessfromAIFeedback.”arXiv. https://doi.org/10.48550/arXiv.2212.08073. Barnes,Tom.2024.“NavigatingRisksfromAdvancedArtificialIntelligence:AGuidefor Philanthropists.”FoundersPledge. https://w.founderspledge.com/research/research-and-recommendations-advanced-art ificial-intelligence. Bengio,Yoshua.2024.“InternationalScientificReportontheSafetyofAdvancedAI.” DepartmentofScience,InnovationandTechnology. https://w.gov.uk/government/publications/international-scientific-report-on-the-safety-o f-advanced-ai. Bengio,Yoshua,GeoffreyHinton,AndrewYao,DawnSong,PieterAbbeel,TrevorDarrell,Yuval NoahHarari,etal.2024.“ManagingextremeAIrisksamidrapidprogress.”Science. 384:6698(842-845).https://doi.org/10.1126/science.adn0117. MappingTechnicalSafetyResearchatAICompanies |32 Bereska,Leonard,andEfstratiosGavves.2024.“MechanisticInterpretabilityforAISafety--A Review.”arXiv.https://doi.org/10.48550/arXiv.2404.14082. Berglund,Lukas,AsaCooperStickland,MikitaBalesni,MaxKaufmann,MegTong,Tomasz Korbak,DanielKokotajlo,andOwainEvans.2023.“TakenoutofContext:OnMeasuring SituationalAwarenessinLLMs.”arXiv.https://doi.org/10.48550/arXiv.2309.00667. Bills,Steven,NickCammarata,DanMossing,HankTillman,LeoGao,GabrielGoh,Ilya Sutskever,JanLeike,JeffWu,andWilliamSaunders.2023.“LanguageModelsCan ExplainNeuronsinLanguageModels.”OpenAI. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html. Birhane,Abeba,RyanSteed,VictorOjewale,BrianaVecchione,andInioluwaDeborahRaji. 2024.“AIAuditing:TheBrokenBusontheRoadtoAIAccountability.”arXiv. https://doi.org/10.48550/arXiv.2401.14462. Bostrom,Nick.2014.Superintelligence:Paths,Dangers,Strategies.Firstedition.Oxford, England:OxfordUniversityPress. Bowman,SamuelR.,JeeyoonHyun,EthanPerez,EdwinChen,CraigPettit,ScottHeiner, KamilėLukošiūtė,etal.2022.“MeasuringProgressonScalableOversightforLarge LanguageModels.”arXiv.https://doi.org/10.48550/arXiv.2211.03540. Brady,James.2024.“DiscoveringAlignmentWindfallsReducesAIRisk.”Elicit.February20, 2024.https://blog.elicit.com/alignment-windfalls/. Buolamwini,JoyandTimnitGebru.2018.“GenderShades:IntersectionalAccuracyDisparities inCommercialGenderClassification.”ConferenceonFairness,Accountabilityand Transparency.https://proceedings.mlr.press/v81/buolamwini18a.html. Burns,Collin,HaotianYe,DanKlein,andJacobSteinhardt.2022.“DiscoveringLatent KnowledgeinLanguageModelsWithoutSupervision.”arXiv. https://doi.org/10.48550/arXiv.2212.03827. Burns,Collin,PavelIzmailov,JanHendrikKirchner,BowenBaker,LeoGao,Leopold Aschenbrenner,YiningChen,etal.2023.“Weak-to-StrongGeneralization:ElicitingStrong CapabilitiesWithWeakSupervision.”arXiv.https://doi.org/10.48550/arXiv.2312.09390. Carlsmith,Joseph.2022.“IsPower-SeekingAIanExistentialRisk?”arXiv. https://doi.org/10.48550/arXiv.2206.13353. Carlsmith,Joseph.2023.“SchemingAIs:WillAIsfakealignmentduringtraininginordertoget power?”arXiv.https://doi.org/10.48550/arXiv.2311.08379. Casper,Stephen,XanderDavies,ClaudiaShi,ThomasKrendlGilbert,JérémyScheurer,Javier Rando,RachelFreedman,etal.2023.“OpenProblemsandFundamentalLimitationsof MappingTechnicalSafetyResearchatAICompanies |33 ReinforcementLearningfromHumanFeedback.”arXiv. https://doi.org/10.48550/arXiv.2307.15217. CenterforAISafety.2023.“StatementonAIRisk.”May2023. https://w.safe.ai/statement-on-ai-risk. Chan,Alan,RebeccaSalganik,AlvaMarkelius,ChrisPang,NitarshanRajkumar,Dmitrii Krasheninnikov,LauroLangosco,etal.2023.“HarmsfromIncreasinglyAgentic AlgorithmicSystems.”Proceedingsofthe2023ACMConferenceonFairness, Accountability,andTransparency.https://doi.org/10.1145/3593013.3594033. Chan,Alan,CarsonEzell,MaxKaufmann,KevinWei,LewisHammond,HerbieBradley,Emma Bluemke,etal.2024.“VisibilityintoAIAgents.”arXiv. https://doi.org/10.48550/arXiv.2401.13138. Chiang,Wei-Lin,LianminZheng,YingSheng,AnastasiosNikolasAngelopoulos,TianleLi, DachengLi,HaoZhang,etal.2024.“ChatbotArena:AnOpenPlatformforEvaluating LLMsbyHumanPreference.”arXiv.https://doi.org/10.48550/arXiv.2403.04132. Christiano,Paul,JanLeike,TomBrown,MiljanMartic,ShaneLegg,andDarioAmodei.2017. “DeepReinforcementLearningfromHumanPreferences.”InAdvancesinNeural InformationProcessingSystems.Vol.30.CurranAssociates,Inc. https://papers.nips.c/paper_files/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e 49-Abstract.html. Christiano,Paul,MarkXu,andAjeyaCotra.2021.“ARC’sFirstTechnicalReport:ElicitingLatent Knowledge.”AlignmentResearchCenter,December. https://w.alignment.org/blog/arcs-first-technical-report-eliciting-latent-knowledge/. Clymer,Joshua,NickGabrieli,DavidKrueger,andThomasLarsen.2024.“SafetyCases:How toJustifytheSafetyofAdvancedAISystems.”arXiv. https://doi.org/10.48550/arXiv.2403.10462. Conitzer,Vincent,andCasparOesterheld.2023.“FoundationsofCooperativeAI.”Proceedings oftheAAAIConferenceonArtificialIntelligence37(13):15359–67. https://doi.org/10.1609/aaai.v37i13.26791. CooperativeAIFoundation.2023.“Foundation.”https://w.cooperativeai.com/foundation. Cotra,Ajeya.2018.“IteratedDistillationandAmplification.”AI-Alignment.March5,2018. https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616. Cottier,Ben.2023.“WhoIsLeadinginAI?AnAnalysisofIndustryAIResearch.”Epoch. November27,2023. https://epochai.org/blog/who-is-leading-in-ai-an-analysis-of-industry-ai-research. MappingTechnicalSafetyResearchatAICompanies |34 Critch,Andrew,andDavidKrueger.2020.“AIResearchConsiderationsforHumanExistential Safety(ARCHES).”arXiv.https://doi.org/10.48550/arXiv.2006.04948. Dafoe,Allan,EdwardHughes,YoramBachrach,TantumCollins,KevinR.McKee,JoelZ.Leibo, KateLarson,andThoreGraepel.2020.“OpenProblemsinCooperativeAI.”arXiv. https://doi.org/10.48550/arXiv.2012.08630. Dalrymple,David.2024.“SafeguardedAI:ConstructingSafetybyDesign.”AdvancedResearch andInventionAgency. https://w.aria.org.uk/wp-content/uploads/2024/01/ARIA-Safeguarded-AI-Programme- Thesis-V1.pdf. Dalrymple,David,JoarSkalse,YoshuaBengio,StuartRussell,MaxTegmark,SanjitSeshia, SteveOmohundro,etal.2024.“TowardsGuaranteedSafeAI:AFrameworkforEnsuring RobustandReliableAISystems.”arXiv.https://w.doi.org/10.48550/arXiv.2405.06624. Davidson,Tom,Jean-StanislasDenain,PabloVillalobos,andGuillemBas.2023.“AIcapabilities canbesignificantlyimprovedwithoutexpensiveretraining.”arXiv. https://w.doi.org/10.48550/arXiv.2312.07413. Denison,Carson,MonteMacDiarmid,FazlBarez,DavidDuvenaud,ShaunaKravec,Samuel Marks,NicholasSchiefer,etal.2024.“SycophancytoSubterfuge:Investigating Reward-TamperinginLargeLanguageModels.”arXiv. https://doi.org/10.48550/arXiv.2406.10162. DepartmentofHomelandSecurity.2024.“MitigatingArtificialIntelligence(AI)Risk:Safetyand SecurityGuidelinesforCriticalInfrastructureOwnersandOperators.”UnitedStates Government. https://w.dhs.gov/publication/safety-and-security-guidelines-critical-infrastructure-own ers-and-operators. DepartmentforScience,Innovation&Technology.2023a.“EmergingProcessesforFrontierAI Safety.”GOV.UK.October27,2023. https://w.gov.uk/government/publications/emerging-processes-for-frontier-ai-safety/e merging-processes-for-frontier-ai-safety. —.2023b.“InternationalSurveyofPublicOpiniononAISafety.”GOV.UK.October27, 2023. https://w.gov.uk/government/publications/international-survey-of-public-opinion-on-ai- safety. Du,Yali,JoelZ.Leibo,UsmanIslam,RichardWillis,andPeterSunehag.2023.“AReviewof CooperationinMulti-AgentLearning.”arXiv.https://doi.org/10.48550/arXiv.2312.05162. MappingTechnicalSafetyResearchatAICompanies |35 Duettmann,Allison,AndersSandberg,AndyMatuschak,AnitaFolwer,BarryBentley,Bobby Kasthuri,DavidDalrymple,etal.2023.2023WholeBrainEmulationWorkshop.Oxford, UK:ForesightInstitute.http://dx.doi.org/10.13140/RG.2.2.30808.88326. EuropeanParliament.2024“ArtificialIntelligenceAct.” https://w.europarl.europa.eu/doceo/document/TA-9-2024-0138_EN.pdf Farquhar,Sebastian,RyanCarey,andTomEveritt.2022.“Path-SpecificObjectivesforSafer AgentIncentives.”arXiv.https://arxiv.org/abs/2204.10018v1. Farquhar,Sebastian,VikrantVarma,ZacharyKenton,JohannesGasteiger,VladimirMikulik, RohinShah.2023.“ChallengeswithUnsupervisedLLMKnowledgeDiscovery.”arXiv. https://doi.org/10.48550/arXiv.2312.10029. Filan,Daniel.2024.AIControlwithBuckShlegerisandRyanGreenblatt.AXRP. https://axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt .html. FrontierModelForum.2024.“OurPurpose.” https://w.frontiermodelforum.org/how-we-work/. Garrabrant,Scott,TsviBenson-Tilsen,AndrewCritch,NateSoares,andJessicaTaylor.2020. “LogicalInduction.”arXiv.https://doi.org/10.48550/arXiv.1609.03543. Garrabrant,Scott,DanielA.Herrmann,andJosiahLopez-Wild.2021.“CartesianFrames.” arXiv.https://doi.org/10.48550/arXiv.2109.10996. Gemp,Ian,ThomasAnthony,YoramBachrach,AvishkarBhoopchand,KaleshaBullard,Jerome Connor,VibhavariDasagi,etal.2022.“Developing,EvaluatingandScalingLearning AgentsinMulti-AgentEnvironments.”arXiv.https://doi.org/10.48550/arXiv.2209.10958. Gleave,Adam.2023.“AISafetyinaWorldofVulnerableMachineLearningSystems.”FARAI. March5,2023.https://far.ai/post/2023-03-safety-vulnerable-world/. GoogleDeepMind.2024.“AIachievessilver-medalstandardsolvingInternationalMathematical Olympiadproblems.”July31,2024. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/. Greenblatt,Ryan,BuckShlegeris,KshitijSachan,andFabienRoger.2024.“AIControl: ImprovingSafetyDespiteIntentionalSubversion.”arXiv. https://doi.org/10.48550/arXiv.2312.06942. Gruetzemacher,Ross,KyleKilian,IyngkarranKumar,DavidManheim,MarkBailey,Pierre Harter,andJoséHernández-Orallo.2024.“EnvisioningaThrivingEcosystemforTesting& EvaluatingAdvancedAI.”NISTAIEOComment. https://w.nist.gov/system/files/documents/2024/02/15/ID019%20-%202024-02-02%2 MappingTechnicalSafetyResearchatAICompanies |36 0Wichita%20State%20University%20et.%20al%2C%20Comments%20on%20AI%20EO% 20RFI.pdf. Guest,Oliver,MichaelAird,andSeánÓhÉigeartaigh.2023.“SafeguardingtheSafeguards: HowBesttoPromoteAlignmentinthePublicInterest.”InstituteforAIPolicy;Strategy. https://w.iaps.ai/research/safeguarding-the-safeguards. Hadfield-Menell,Dylan,AncaDragan,PieterAbbeel,andStuartRussell.2024.“Cooperative InverseReinforcementLearning.”arXiv.https://doi.org/10.48550/arXiv.1606.03137. Hadshar,Rose.2023.“AReviewoftheEvidenceforExistentialRiskfromAIviaMisaligned Power-Seeking.”arXiv.https://doi.org/10.48550/arXiv.2310.18244. Hammond,Lewis.2023.“Multi-AgentRisksfromAdvancedAI.”InMulti-AgentSecurity Workshop:SecurityasKeytoAISafety.https://neurips.c/virtual/2023/82192. Hassija,Vikas,VinayChamola,AtmeshMahapatra,AbhinandanSingal,DivyanshGoel,Kaizhu Huang,SimoneScardapane,etal.2024.“InterpretingBlack-BoxModels:AReviewon ExplainableArtificialIntelligence.”Cognitivecomputation,45–74. https://doi.org/10.1007/s12559-023-10179-8. Hayes,Jamie,IliaShumailov,EleniTriantafillou,AmrKhalifa,andNicolasPapernot.2024. “InexactUnlearningNeedsMoreCarefulEvaluationstoAvoidaFalseSenseofPrivacy.” arXiv.https://doi.org/10.48550/arXiv.2403.01218. Heikkilä,Melissa.2024.“TheAIActisdone.Here’swhatwill(andwon’t)change.”MIT TechnologyReview.March19,2024. https://w.technologyreview.com/2024/03/19/1089919/the-ai-act-is-done-heres-what- will-and-wont-change/. Hendrycks,Dan,NicholasCarlini,JohnSchulman,andJacobSteinhardt.2022.“Unsolved ProblemsinMLSafety.”arXiv.https://doi.org/10.48550/arXiv.2109.13916. Hendrycks,Dan,andMantasMazeika.2022.“X-RiskAnalysisforAIResearch.”arXiv. https://doi.org/10.48550/arXiv.2206.05862. Hubinger,Evan,CarsonDenison,JesseMu,MikeLambert,MegTong,MonteMacDiarmid, TameraLanham,etal.2024.“SleeperAgents:TrainingDeceptiveLLMsThatPersist ThroughSafetyTraining.”arXiv.https://doi.org/10.48550/arXiv.2401.05566. InformationCommissioner’sOffice,andTheAlanTuringInstitute.2022.“ExplainingDecisions MadewithAI.”ICO. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ explaining-decisions-made-with-artificial-intelligence. InternationalDialoguesonAISafety.2024.“BeijingStatement.”March11,2024. https://idais.ai/. MappingTechnicalSafetyResearchatAICompanies |37 Irving,Geoffrey,PaulChristiano,andDarioAmodei.2018.“AISafetyviaDebate.”arXiv. https://doi.org/10.48550/arXiv.1805.00899. Ji,Jiaming,TianyiQiu,BoyuanChen,BorongZhang,HantaoLou,KaileWang,YawenDuan,et al.2024.“AIAlignment:AComprehensiveSurvey.”arXiv. https://doi.org/10.48550/arXiv.2310.19852. Juliussen,BjørnAslak,JonPetterRui,andDagJohansen.2023.“AlgorithmsThatForget: MachineUnlearningandtheRighttoErasure.”ComputerLaw&SecurityReview51 (November):105885.https://doi.org/10.1016/j.clsr.2023.105885. Kästner,Lena,andBarnabyCrook.2023.“ExplainingAIThroughMechanisticInterpretability.” PhilSci.https://philsci-archive.pitt.edu/22747/. Kenton,Zachary,RamanaKumar,SebastianFarquhar,JonathanRichens,MattMacDermott, andTomEveritt.2022.“DiscoveringAgents.”arXiv.https://arxiv.org/abs/2208.08345v2. Khlaaf,Heidy.2023.“TowardComprehensiveRiskAssessmentsandAssuranceofAI-Based Systems.”TrailofBits. https://w.trailofbits.com/documents/Toward_comprehensive_risk_assessments.pdf. Kinniment,Megan,LucasJunKobaSato,HaoxingDu,BrianGoodrich,MaxHasin,Lawrence Chan,LukeHaroldMiles,etal.2024.“EvaluatingLanguage-ModelAgentsonRealistic AutonomousTasks.”arXiv.https://doi.org/10.48550/arXiv.2312.11671. Knight,Will.2024.“OpenAI’sLong-TermAIRiskTeamHasDisbanded.”Wired.May17,2024. https://w.wired.com/story/openai-superalignment-team-disbanded/. Krakovna,Victoria,andJanosKramar.2023.“Power-SeekingCanBeProbableandPredictive forTrainedAgents.”arXiv.https://doi.org/10.48550/arXiv.2304.06528. Lanham,Tamera,AnnaChen,AnshRadhakrishnan,BenoitSteiner,CarsonDenison,Danny Hernandez,DustinLi,etal.2023.“MeasuringFaithfulnessinChain-of-Thought Reasoning.”arXiv.https://doi.org/10.48550/arXiv.2307.13702. Leibo,JoelZ.,EdgarDuéñez-Guzmán,AlexanderSashaVezhnevets,JohnP.Agapiou,Peter Sunehag,RaphaelKoster,JaydMatyas,CharlesBeattie,IgorMordatch,andThore Graepel.2021.“ScalableEvaluationofMulti-AgentReinforcementLearningwithMelting Pot.”arXiv.https://doi.org/10.48550/arXiv.2107.06857. Leike,Jan,andIlyaSutskever.2023.“IntroducingSuperalignment.”OpenAI.July5,2023. https://openai.com/blog/introducing-superalignment. Li,Nathaniel,AlexanderPan,AnjaliGopal,SummerYue,DanielBerrios,AliceGatti,JustinD.Li, etal.2024“TheWMDPBenchmark:MeasuringandReducingMaliciousUseWith Unlearning.”arXiv.https://doi.org/10.48550/arXiv.2403.03218. MappingTechnicalSafetyResearchatAICompanies |38 Lin,Stephanie,JacobHilton,andOwainEvans.2022.“TruthfulQA:MeasuringHowModels MimicHumanFalsehoods.”arXiv.https://doi.org/10.48550/arXiv.2109.07958. Lu,Chris,CongLu,RobertTjarkoLange,JakobFoerster,JeffClune,andDavidHa.2024.“The AIScientist:TowardsFullyAutomatedOpen-EndedScientificDiscovery.”arXiv. https://doi.org/10.48550/arXiv.2408.06292. MacCarthy,Mark.2023.“TheUSandItsAlliesShouldEngagewithChinaonAILawand Policy.”Brookings.October19,2023. https://w.brookings.edu/articles/the-us-and-its-allies-should-engage-with-china-on-ai-l aw-and-policy/. Mandelbaum,Eric.2022.“EverythingandMore:TheProspectsofWholeBrainEmulation.” JournalofPhilosophy119(8):444–59.https://doi.org/10.5840/jphil2022119830. Matthews,Dylan.2023.“The$1BillionGambletoEnsureAIDoesn’tDestroyHumanity.”Vox. September25,2023. https://w.vox.com/future-perfect/23794855/anthropic-ai-openai-claude-2. Metz,Cade.2016.“InsideOpenAI,ElonMusk’sWildPlantoSetArtificialIntelligenceFree.” Wired,April. https://w.wired.com/2016/04/openai-elon-musk-sam-altman-plan-to-set-artificial-intelli gence-free/. Metz,Rachel.2024.“Ex-OpenAISafetyLeaderLeiketoJoinRivalAnthropic.”Bloomberg.May 28,2024. https://w.bloomberg.com/news/articles/2024-05-28/ex-openai-safety-leader-leike-to-j oin-rival-anthropic. Mukobi,Gabriel.2024.“ReasonstoDoubttheImpactofAIRiskEvaluations.”arXiv. https://doi.org/10.48550/arXiv.2408.02565. Murgia,Madhumita.2023.“Google’sDeepMind-BrainMerger:TechGiantRegroupsforAI Battle.”FinancialTimes,April. https://w.ft.com/content/f4f73815-6fc2-4016-bd97-4bace459e95e. Nannini,Luca,AgatheBalayn,andAdamLeonSmith.2023.“ExplainabilityinAIPolicies:A CriticalReviewofCommunications,Reports,Regulations,andStandardsintheEU,US, andUK.”arXiv.https://doi.org/10.48550/arXiv.2304.11218. NIST.2024.“StrategicVision.”NIST,May.https://w.nist.gov/aisi/strategic-vision. Ngo,Richard,LawrenceChan,andSörenMindermann.2024.“TheAlignmentProblemfroma DeepLearningPerspective.”arXiv.https://doi.org/10.48550/arXiv.2209.00626. MappingTechnicalSafetyResearchatAICompanies |39 O’Brien,Joe,ShaunEe,andZoeWilliams.2023.“Deploymentcorrections:Anincident responseframeworkforfrontierAImodels.”InstituteforAIPolicyandStrategy. https://w.iaps.ai/research/deployment-corrections. Olah,Chris.2023.“InterpretabilityDreams.”TransformerCircuits.May24,2023. https://transformer-circuits.pub/2023/interpretability-dreams/index.html. OpenAI.2023.“Preparedness.”December18,2023.https://openai.com/safety/preparedness. OpenPhilanthropy.2022.“RequestforProposalsforProjectsinAIAlignmentThatWorkwith DeepLearningSystems.OpenPhilanthropy.”January2022. https://w.openphilanthropy.org/request-for-proposals-for-projects-in-ai-alignment-that- work-with-deep-learning-systems/. Ouyang,Long,JeffWu,XuJiang,DiogoAlmeida,CarrollL.Wainwright,PamelaMishkin, ChongZhang,etal.2022.“TrainingLanguageModelstoFollowInstructionswithHuman Feedback.”arXiv.https://doi.org/10.48550/arXiv.2203.02155. Pacchiardi,Lorenzo,AlexJ.Chan,SörenMindermann,IlanMoscovitz,AlexaY.Pan,YarinGal, OwainEvans,andJanBrauner.2023.“HowtoCatchanAILiar:LieDetectionin Black-BoxLLMsbyAskingUnrelatedQuestions.”arXiv. https://doi.org/10.48550/arXiv.2310.01405. Pan,Alexander,JunShernChan,AndyZou,NathanielLi,StevenBasart,ThomasWoodside, JonathanNg,HanlinZhang,ScottEmmons,andDanHendrycks.2023.“DotheRewards JustifytheMeans?MeasuringTrade-OffsBetweenRewardsandEthicalBehaviorinthe MACHIAVELLIBenchmark.”arXiv.https://doi.org/10.48550/arXiv.2304.03279. Park,PeterS.,SimonGoldstein,AidanO’Gara,MichaelChen,DanHendrycks.2024.“AI deception:Asurveyofexamples,risks,andpotentialsolutions.”Patterns. https://doi.org/10.1016/j.patter.2024.100988. Pauketat,JanetVT,JustinBullock,andJacyReeseAnthis.2023.“PublicOpiniononAISafety: AIMS2023Supplement.”OSF.https://doi.org/10.31234/osf.io/jv9rz. Perez,Ethan,SamRinger,KamilėLukošiūtė,KarinaNguyen,EdwinChen,ScottHeiner,Craig Pettit,etal.2022.“DiscoveringLanguageModelBehaviorswithModel-Written Evaluations.”arXiv.https://doi.org/10.48550/arXiv.2212.09251. Perrigo,Billy.2023a.“DeepMindCEODemisHassabisUrgesCautiononAI.”TIME,January. https://time.com/6246119/demis-hassabis-deepmind-interview/. —.2023b.“Bing’sAIIsThreateningUsers.That’sNoLaughingMatter.”TIME,February. https://time.com/6256529/bing-openai-chatgpt-danger-alignment/. Piper,Kelsey.2019.“StarCraftisadeep,complicatedwarstrategygame.Google’sAlphaStar AIcrushedit.”Vox.January25,2019. MappingTechnicalSafetyResearchatAICompanies |40 https://w.vox.com/future-perfect/2019/1/24/18196177/ai-artificial-intelligence-google- deepmind-starcraft-game. Poli,Michael,StefanoMassaroli,EricNguyen,DanielY.Fu,TriDao,StephenBaccus,Yoshua Bengio,StefanoErmon,andChristopherRé.2023.“HyenaHierarchy:TowardsLarger ConvolutionalLanguageModels.”arXiv.https://doi.org/10.48550/arXiv.2302.10866. Pouget,Hadrien.2024.“WhatwilltheroleofstandardsbeinAIgovernance?”AdaLovelace Institute.April5,2023. https://w.adalovelaceinstitute.org/blog/role-of-standards-in-ai-governance/. Radhakrishnan,Ansh,KarinaNguyen,AnnaChen,CarolChen,CarsonDenison,Danny Hernandez,EsinDurmus,etal.2023.“QuestionDecompositionImprovestheFaithfulness ofModel-GeneratedReasoning.”arXiv.https://doi.org/10.48550/arXiv.2307.11768. Räuker,Tilman,AnsonHo,StephenCasper,andDylanHadfield-Menell.2023.“Toward TransparentAI:ASurveyonInterpretingtheInnerStructuresofDeepNeuralNetworks.”In, 464–83.IEEEComputerSociety.https://doi.org/10.1109/SaTML54575.2023.00039. Reuel,Anka,BenBucknall,StephenCasper,TimFist,LisaSoder,OnniAarne,Lewis Hammond,etal.2024.“OpenProblemsinTechnicalAIGovernance.”arXiv. https://doi.org/10.48550/arXiv.2407.14981. Rudner,TimG.J.,andHelenToner.2024.“KeyConceptsinAISafety:ReliableUncertainty QuantificationinMachineLearning.”CenterforSecurityandEmergingTechnology. https://cset.georgetown.edu/publication/key-concepts-in-ai-safety-reliable-uncertainty-qua ntification-in-machine-learning/. Russell,StuartJ.2019.HumanCompatible:ArtificialIntelligenceandtheProblemofControl. NewYork:Viking. Scheurer,Jérémy,MikitaBalesni,andMariusHobbhahn.2023.“TechnicalReport:Large LanguageModelsCanStrategicallyDeceiveTheirUsersWhenPutUnderPressure.”arXiv. https://doi.org/10.48550/arXiv.2311.07590. Schiermeier,Quirin.2015.“ThesciencebehindtheVolkswagenemissionsscandal.”Nature. September24,2015.https://w.nature.com/articles/nature.2015.18426. Schuett,Jonas,NoemiDreksler,MarkusAnderljung,DavidMcCaffary,LennartHeim,Emma Bluemke,andBenGarfinkel.2023.“TowardsBestPracticesinAGISafetyand Governance:ASurveyofExpertOpinion.”arXiv. https://doi.org/10.48550/arXiv.2305.07153. Schwartz,Victor.1998.“The‘Restatement(third)ofTorts:ProductsLiability’:AGuidetoIts Highlights.”Tort&InsuranceLawJournal.31(1):85-100. http://w.jstor.org/stable/25763264. MappingTechnicalSafetyResearchatAICompanies |41 Sharkey,Lee,ClíodhnaNíGhuidhir,DanBraun,JérémyScheurer,MikitaBalesni,Lucius Bushnaq,CharlotteStix,andMariusHobbhahn.2023.“ACausalFrameworkforAI RegulationandAuditing.”ApolloResearch. https://w.apolloresearch.ai/research/a-causal-framework-for-ai-regulation-and-auditing. Shevlane,Toby,SebastianFarquhar,BenGarfinkel,MaryPhuong,JessWhittlestone,Jade Leung,DanielKokotajlo,etal.2023.“ModelEvaluationforExtremeRisks.”arXiv. https://doi.org/10.48550/arXiv.2305.15324. Soares,Nate,andBenyaFallenstein.2017.“AgentFoundationsforAligningMachine IntelligencewithHumanInterests:ATechnicalResearchAgenda.”InTheTechnological Singularity:ManagingtheJourney,editedbyVictorCallaghan,JamesMiller,Roman Yampolskiy,andStuartArmstrong,103–25.TheFrontiersCollection.Berlin,Heidelberg: Springer.https://doi.org/10.1007/978-3-662-54033-6_5. Stewart,Harlan.2024.“June2024Newsletter.”MachineIntelligenceResearchInstitute.June 14,2024.https://intelligence.org/2024/06/14/june-2024-newsletter/. Templeton,Adly,TomConerly,JonathanMarcus,JackLindsey,TrentonBricken,BrianChen, AdamPearce,etal.2024.“ScalingMonosemanticity:ExtractingInterpretableFeatures fromClaude3Sonnet.”TransformerCircuitsThread. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html. TheEconomist.2014.“Thedozywatchdogs.”December11,2014. https://w.economist.com/briefing/2014/12/11/the-dozy-watchdogs. TheWhiteHouse.2023.“ExecutiveOrderontheSafe,Secure,andTrustworthyDevelopment andUseofArtificialIntelligence.”October30,2023. https://w.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-ord er-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/. Toner,Helen,andAshwinAcharya.2022.“ExploringClustersofResearchinThreeAreasofAI Safety.”CenterforSecurityandEmergingTechnology. https://cset.georgetown.edu/publication/exploring-clusters-of-research-in-three-areas-of-a i-safety/. UKAISafetyInstitute.2024a.“AdvancedAIEvaluationsatAISI:MayUpdate|AISIWork.”May 20,2024.https://w.aisi.gov.uk/work/advanced-ai-evaluations-may-update. —.2024b.“SystemicAISafetyFastGrants.”2024.https://w.aisi.gov.uk/grants. Vinitsky,Eugene,RaphaelKöster,JohnPAgapiou,EdgarADuéñez-Guzmán,AlexanderS Vezhnevets,andJoelZLeibo.2023.“ALearningAgentThatAcquiresSocialNormsfrom PublicSanctionsinDecentralizedMulti-AgentSettings.”CollectiveIntelligence2(2): 26339137231162025.https://doi.org/10.1177/26339137231162025. MappingTechnicalSafetyResearchatAICompanies |42 Yudkowsky,Eliezer.2018.“TheRocketAlignmentProblem.MachineIntelligenceResearch Institute.”October3,2018.https://intelligence.org/2018/10/03/rocket-alignment/. Zhang,Yue,YafuLi,LeyangCui,DengCai,LemaoLiu,TingchenFu,XintingHuang,etal.2023. “Siren’sSongintheAIOcean:ASurveyonHallucinationinLargeLanguageModels.” arXiv.https://doi.org/10.48550/arXiv.2309.01219. Ziegler,DanielM.,SeraphinaNix,LawrenceChan,TimBauman,PeterSchmidt-Nielsen,Tao Lin,AdamScherlis,etal.2022.“AdversarialTrainingforHigh-StakesReliability.”arXiv. https://doi.org/10.48550/arXiv.2205.01663. Zimmermann,RolandS.,ThomasKlein,andWielandBrendel.2024.“ScaleAloneDoesNot ImproveMechanisticInterpretabilityinVisionModels.”AdvancesinNeuralInformation ProcessingSystems36(February). https://proceedings.neurips.c/paper_files/paper/2023/hash/b4aadf04d6fde46346db455 402860708-Abstract-Conference.html. Zou,Andy,LongPhan,SarahChen,JamesCampbell,PhillipGuo,RichardRen,Alexander Pan,etal.2023a.“RepresentationEngineering:ATop-DownApproachtoAI Transparency.”arXiv.https://doi.org/10.48550/arXiv.2310.01405. Zou,Andy,ZifanWang,NicholasCarlini,MiladNasr,J.ZicoKolter,andMattFredrikson.2023b. “UniversalandTransferableAdversarialAttacksonAlignedLanguageModels.”arXiv. https://doi.org/10.48550/arXiv.2307.15043. MappingTechnicalSafetyResearchatAICompanies |43