Free AI web copilot to create summaries, insights and extended knowledge, download it at here
18066
Abstract
uqGAn_p2iA.png"><figcaption>The evolution of humans offers a cautionary tale for the development of advanced AI. Evolution selects for traits that maximize <a href="https://en.wikipedia.org/wiki/Inclusive_fitness">inclusive genetic fitness</a> — this process led to human traits that did aim humans towards inclusive genetic fitness in our ancestral environment; now that we have altered our environment considerably, however, these same traits often push us in very different directions, revealing that the goals of humans are typically not deeply aligned with the criteria evolution was selecting for.</figcaption></figure><h2 id="3489">To be clear, the above worries don’t imply that advanced AI wouldn’t be able to “understand” what we really wanted, but instead that this understanding wouldn’t necessarily translate to the AI systems acting in accordance with our wants:</h2><ul><li>By definition, advanced AI would be able to perform tasks that require understanding “fuzzy” aspects of human goals and behavior,<a href="#0deb">¹⁹</a> and thus such AI would likely “understand”<a href="#5787">²⁰</a> ways in which the goals it internalized from training conflicted with the goals its designers intended (e.g., it may recognize that its goals were actually just imperfect proxies for its designers’ goals).</li><li>If an AI recognizes this discrepancy between its internalized goals and its designers’ intended goals, however, that does not automatically cause the discrepancy to disappear. Continuing the evolution analogy, when a human learns that their love of artificially-flavored food is due to evolutionary pressure for nutritious food, the human doesn’t suddenly start desiring nutritious food instead; contrarily, the human typically continues to desire the same artificially-flavored food as before.</li><li>Once the AI system can understand the discrepancy, however, it may face incentives to <a href="https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/">hide</a> the discrepancy from its human overseer (i.e., it might “play nice” for the rest of training and do harmful things only when deployed in the real world). This deceptive behavior<a href="#f9ae">²¹</a> may happen if, for instance, the AI has goals related to affecting the world in ways the overseer would disapprove of.<a href="#e10d">²²</a></li><li>Speculatively, we might thus <a href="https://www.alignmentforum.org/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization">wind up in</a> a bit of a catch-22: before the AI system has been trained sufficiently to understand the discrepancy, it cannot be trained to automatically bring all its goals into alignment with its overseer’s desires, and once it has reached this understanding, it will face an incentive to deceive its overseer instead.</li><li>If we could simply tell the AI “do what we mean, not what we say” and get the AI to robustly listen to that, we would have solved this problem, but no one knows how this can be accomplished given current research.</li></ul><h2 id="618e">It’s possible advanced AI will be built before we solve the above problems, or even without anyone really understanding the systems that are built:</h2><ul><li>While no one currently knows how to build advanced AI, there is no strong reason to assume we’ll solve the above problems before we get there.</li><li>Current AI systems are typically “<a href="https://arxiv.org/abs/1911.12116">black boxes</a>,” meaning that their designers don’t understand their inner workings. Emergent, learned capabilities <a href="https://arxiv.org/abs/2201.11903">occasionally remain undiscovered</a> for significant periods. If the current AI paradigm leads to advanced AI, these advanced systems will likely similarly be black boxes.</li><li>History is filled with examples of technologies that were created before a good understanding of them was developed; for instance, humans built bridges for millennia before developing mechanical engineering (note many such bridges collapsed, in now-predictable ways), flight was developed <a href="https://web.stanford.edu/~cantwell/AA200_Course_Material/AA200_References/Jones_Classical_Aerodynamic_Theory.pdf">before much aerodynamic theory</a>, and so on.</li></ul><figure id="6390"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*DFIqvjlCM7Ak_WZgwzw9Gw.jpeg"><figcaption>Airplanes are one of many technologies that were invented before a mature understanding of their workings was developed (<a href="https://www.smithsonianmag.com/history-of-flight/airplanes-that-transformed-aviation-46502830/">source</a>)</figcaption></figure><h1 id="0a69">4 — Poorly-directed advanced AI could be catastrophic for humanity</h1><p id="b97e">Our typical playbook regarding new technologies is to deploy them before tackling all potential major issues, then course correct them over time, solving problems after they crop up. For instance, modern seatbelts were not invented until <a href="https://patents.google.com/patent/US2710649">1951</a>, 43 years after the model T Ford’s introduction; consumer gasoline contained the <a href="https://www.nbcnews.com/health/health-news/lead-gasoline-blunted-iq-half-us-population-study-rcna19028">neurotoxin</a> lead for decades, before being phased out; etc.</p><p id="a60c"><b>With advanced AI, on the other hand, relatively early failures at appropriately directing these systems may preclude later course correction, possibly yielding catastrophe. This dynamic necessitates flipping the typical script — anticipating and solving problems sufficiently far ahead of time, so that our ability as humans to course correct is never extinguished.</b></p><h2 id="86f4">As mentioned above, poorly-directed advanced AI systems may curtail humanity’s ability to course correct:</h2><ul><li>Soon after we develop advanced AI, we will likely face AI systems that far surpass humans in most cognitive tasks,<a href="#cff5">²³</a> including in tasks relevant for influencing the world (such as technological development, social/political persuasion, and cyber operations).<a href="#c194">²⁴</a></li><li>Insofar as advanced AI systems and humans pursue conflicting goals, advanced AI will likely outcompete or outmaneuver humans to achieve their goals above ours.<a href="#af75">²⁵</a></li><li>Poorly-directed advanced AI systems would <a href="https://arxiv.org/abs/1611.08219">likely</a> <a href="https://arxiv.org/abs/2212.09251">determine</a> (correctly) that their then-current goals would not be achieved if humans redirected them towards other goals or shut them off, and thus would (successfully) take steps to prevent these interventions.<a href="#6e71">²⁶</a></li></ul><h2 id="8e90">From there, the world could develop in unexpected and undesirable ways, with no recourse:</h2><ul><li>Humans in such a world dominated by advanced AI may be as vulnerable as many animals in today’s (human-dominated) world, where our fate would depend more on the goals of advanced AI systems than on our own goals.<a href="#45c9">²⁷</a></li><li>This situation could take many forms: a single AI could <a href="https://www.amazon.com/Superintelligence-Nick-Bostrom-audiobook/dp/B00LPMFE9Y">become more powerful</a> than the rest of civilization combined; a group of AIs could <a href="https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/">coordinate</a> to become more powerful than the rest of civilization; an ecosystem of different (groups of) AIs may wind up in a balance of power with each other, but with humans effectively <a href="https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic#Part_1__Slow_stories__and_lessons_therefrom">out of the loop of society’s decisions</a>; etc.</li><li>It’s unclear if such worlds would even preserve features necessary for human survival — as a thought experiment, would a fully-automated and growing AI economy (with various AI systems psychopathically pursuing various goals) ensure food was provided for humans, or that the byproducts of industrial processes never altered the atmosphere beyond ranges survivable to humans? Maybe?</li></ul><figure id="5423"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*QGY3WdXVHqo45fqp7Twr8Q.png"><figcaption>Some human activities, such as deforestation, are inadvertently-yet-predictably driving many animals to extinction; if civilization was instead controlled by advanced AI, humans might join these animals in a similar fate (<a href="https://www.science.org/doi/10.1126/sciadv.1400253">source</a>)</figcaption></figure><h2 id="6f91">While the above worries may sound extreme, they are not particularly fringe among relevant experts who have examined the issue (though there is considerable disagreement among experts and not all share these concerns):</h2><ul><li>Some leading AI researchers have <a href="https://slatestarcodex.com/2015/05/22/ai-researchers-on-ai-risk/">publicly voiced these concerns</a>, including UC Berkeley CS professor <a href="https://www.penguinrandomhouse.com/books/566677/human-compatible-by-stuart-russell/">Stuart Russell</a>, Turing Award winner <a href="https://www.youtube.com/watch?v=uG2SKIdLUj4&t=414s">Geoffrey Hinton</a>, cofounder & Chief Scientist of OpenAI <a href="https://www.youtube.com/watch?v=Yf1o0TQzry8">Ilya Sutskever</a>, and cofounder & Chief Scientist of DeepMind <a href="https://www.lesswrong.com/posts/No5JpRCHzBrWA4jmS/q-and-a-with-shane-legg-on-risks-from-ai">Shane Legg</a>.<a href="#7bf2">²⁸</a></li><li>In a <a href="https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/">recent survey</a> of top AI researchers, when asked explicitly, “What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species?” the majority estimated at least a 1 in 20 chance of such an outcome.<a href="#7b4b">²⁹</a></li><li>Some leading AI labs, such as <a href="https://www.alignmentforum.org/s/4iEpGXbD3tQW5atab/p/GctJD5oCDRxCspEaZ">DeepMind</a>, <a href="https://openai.com/blog/our-approach-to-alignment-research/">OpenAI</a>, and <a href="https://www.anthropic.com/index/core-views-on-ai-safety">Anthropic</a>, consider these risks serious enough that they have hired research teams to attempt to address the issue.</li><li>Many researchers in the field of existential risk think there are particularly high risks associated with misaligned advanced AI.<a href="#6a32">³⁰</a></li></ul><figure id="b7be"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*e4xrhQWhERcWYGvGbOP71Q.png"><figcaption>Results from a recent <a href="https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/">survey</a> of top AI researchers, indicating that among AI experts, concern about existential risk from advanced AI is prevalent.<a href="#ed2f">³¹</a></figcaption></figure><h1 id="2573">5 — There are steps we can take now to reduce the danger</h1><p id="7c70">To reduce the risks discussed above, two broad types of work are being pursued — developing technical solutions that enable advanced AI to be directed as its designers intend (i.e., <i>technical AI alignment research</i>) and other, nontechnical work geared towards ensuring these technical solutions are developed and implemented where necessary (this nontechnical work falls under the larger umbrella of <i>AI governance<a href="#1bea"></a></i><a href="#1bea">³²</a>).</p><h2 id="2385">Some technical AI alignment research involves working with current AI systems to direct them towards desired goals, with the hope that insights transfer to advanced AI:</h2><ul><li>Advanced AI systems might resemble current AI systems to some degree, so methods for directing current AI systems may yield valuable insights that transfer over to advanced AI.</li><li>Research intuitions sometimes transfer between engineering paradigms, so even if advanced AI does not resemble current AI, intuitions gained from directing current AI may still be valuable for directing advanced AI.</li></ul><h2 id="2723">Other technical AI alignment research involves more theoretical or abstract work:</h2><ul><li>This research often abstracts away the specifics of how advanced AI may work and instead considers how idealized AI systems with traits such as <a href="https://www.alignmentforum.org/posts/cfXwr6NC9AqZ9kr8g/literature-review-on-goal-directedness">goal-directedness</a>, <a href="https://www.alignmentforum.org/tag/embedded-agency">embeddedness in their environment</a>, and high <a href="https://www.alignmentforum.org/posts/Q4hLMDrFd8fbteeZ8/measuring-optimization-power">optimization power</a> may be formulated so that they could be directed according to the (hard to specify) wishes of their (future) designers.</li><li>These sorts of theoretical abstractions allow for research relevant to AI systems with capabilities far beyond those available today or that operate according to unfamiliar processes.</li></ul><p id="3ee9">The next two paragraphs list two broad areas of technical AI alignment research — note that I’m listing these areas simply for illustrative purposes, and there are many more areas that I don’t list.</p><h2 id="aee8">Understanding the inner workings of current black-box AI systems:</h2><ul><li>Better understanding may enable both designing AI in more intentional ways and checking (before deployment) if systems possess dangerous emergent capabilities.</li><li>Further, good understanding of the internal workings of AI systems may allow for training AI not just based on outward behavior, but also on inner workings, potentially allowing for more easily directing AI systems to adopt or avoid particular internal procedures (e.g., it may be possible to train AI to not be deceptive via feedback on the AI’s internal workings<a href="#5dd4">³³</a>).</li><li><a href="https://transformer-circuits.pub/2022/mech-interp-essay/index.html">Mechanistic interpretability</a> research involves developing methods to understand inner workings of otherwise black-box AI systems (<a href="https://transformer-circuits.pub/2021/framework/index.html">example</a>).</li><li><a href="https://deeplearningtheory.com/">Deep Learning theory</a> research involves investigating why current AI systems develop in the sorts of ways they do, as well as describing underlying dynamics (<a href="https://openai.com/blog/deep-double-descent/">example</a>).</li></ul><figure id="7f60"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*FzRYJeVYYrzds4Myk9AHEQ.png"><figcaption>An example of mechanistic interpretability research — investigating the inner workings of image recognition systems (<a href="https://distill.pub/2020/circuits/zoom-in/">source</a>)</figcaption></figure><h2 id="4375">Developing methods for ensuring the honesty or truthfulness of AI systems:</h2><ul><li>Many cutting-edge AI systems specializing in language generation are <a href="https://arxiv.org/abs/2104.07567">prone</a> to making factually inaccurate statements, <a href="https://arxiv.org/abs/2102.01017">sometimes</a> despite the exact same system previously having made a true statement on the same exact factual matter (e.g., the system may answer a factual question inaccurately, despite previously having answered the same question accurately).<a href="#dedd">³⁴</a></li><li>Research into <a href="https://www.alignmentforum.org/posts/sdxZdGFtAwHGFGKhg/truthful-and-honest-ai#Truthful_systems">truthful AI</a> aims for AI systems that avoid making such false claims (<a href="https://arxiv.org/abs/2109.07958">example</a>), while research in the related field of <a href="https://www.alignmentforum.org/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Honest_AI">honest AI<i></i></a><i> </i>seeks AI systems that make claims in line with their learned models of the world (<a href="https://arxiv.org/abs/2212.03827">example</a>).<a href="#29c4">³⁵</a></li><li>Work to direct AI systems to only ever be “honest” or “truthful” may function as <a href="https://www.alignmentforum.org/posts/jWkqACmDes6SoAiyE/truthful-lms-as-a-warm-up-for-aligned-agi">practice</a> for later work directing advanced AI toward other important-yet-“fuzzy” goals.</li><li>Additionally, if we could direct advanced AI to be honest, that alone may reduce risks related to deception, as then the system could not pretend to lack knowledge that it had, nor could it necessarily develop strategic plans hidden from human oversight.</li><li>In addition to empirical work on directing current AI systems to be honest/truthful, researchers are pursuing theoretical work on methods to <a href="https://www.alignmentforum.org/posts/rxoBY9CMkqDsHt25t/eliciting-latent-knowledge-elk-distillation-summary"><i>elicit latent knowledge</i></a> from advanced AI — that is, to read off “knowledge” that the AI has, thereby effectively forcing it to be honest.</li></ul><p id="f0fa"><i>See more: the online <a href="https://www.agisafetyfundamentals.com/ai-alignment-curriculum">AI Alignment Curriculum</a> from the AGI Safety Fundamentals program describes several further technical AI alignment research avenues in more detail, as does the paper <a href="https://arxiv.org/abs/2109.13916">Unsolved Problems in ML Safety</a>.<a href="#a140"></a></i><a href="#a140">³⁶</a></p><h2 id="2bda">On the nontechnical side, several areas of AI governance are relevant for reducing misalignment risks from advanced AI, including work to:</h2><ul><li>Reduce risks of corner-cutting on the development of advanced AI — if advanced AI is constructed in a hurried manner or without proper safety measures, it may be more likely to wind up poorly directed. Worryingly, most software is currently developed in a relatively haphazard way, and the field of AI does not have a particularly strong culture of safety the way some disciplines, like nuclear engineering, do. Some current work to reduce this risk is geared towards reducing a zero-sum “race dynamic” towards advanced AI.<a href="#1112">³⁷</a></li>
Options
<li>Improve institutional decision-making processes (especially on emerging technology) — plausible reforms to improve societal decision-making are obviously too numerous and varied to mention, but broadly speaking, better governmental, international, and corporate decision-making may yield more sensible actions to promote aligned AI systems in the run up to advanced AI.</li></ul><p id="73f5"><i>See more: the <a href="https://www.agisafetyfundamentals.com/ai-governance-curriculum">AI Governance Curriculum</a> from the AGI Safety Fundamentals program describes further areas of AI governance work in more detail.</i></p><p id="296c"><b>Note that technical problems can sometimes take decades to solve, so even if advanced AI is decades away, it’s still reasonable to begin working on developing solutions now. </b>Current technical AI alignment work is occurring in academic labs (e.g., at <a href="https://humancompatible.ai/">UC Berkeley’s CHAI</a>, among many other academic labs), in nonprofits and public benefit corporations (e.g., <a href="https://www.redwoodresearch.org/">Redwood Research</a> and <a href="https://www.anthropic.com/">Anthropic</a>), and in industrial labs (e.g., <a href="https://www.alignmentforum.org/posts/nzmCvRvPm4xJuqztv/deepmind-is-hiring-for-the-scalable-alignment-and-alignment">DeepMind</a> and <a href="https://openai.com/blog/our-approach-to-alignment-research/">OpenAI</a>). A recent survey of top AI researchers, however, <a href="https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/">indicates</a> most (69%) think society should prioritize “AI safety research”<a href="#4028">³⁸</a> either “more” or “much more” than currently.</p><p id="9511"><i>An earlier version of this piece first appeared on the <a href="https://www.agisafetyfundamentals.com/alignment-introduction">AGI Safety Fundamentals website</a>.</i></p><h1 id="53ae">Endnotes:</h1><ol><li><a href="#9ac2">˄</a> It should be noted that some researchers <a href="https://medium.com/@francois.chollet/the-impossibility-of-intelligence-explosion-5be4a9eda6ec">view</a> the concept of “general intelligence” as flawed and <a href="https://twitter.com/fchollet/status/1070752141374054400">consider</a> the term “AGI” to be either a misnomer at best or confused at worst. Nevertheless, in this piece we are concerned with the capabilities of AI systems, not whether such systems should be referred to as “generally intelligent,” so disagreement over the coherency of the term “AGI” doesn’t affect the arguments in this piece.</li><li><a href="#9ac2">˄</a> In this second scenario, different AGIs might specialize in a similar manner to how human workers specialize in the economy today.</li><li><a href="#5905">˄</a> A future paradigm could, for instance, be based on future discoveries in neuroscience.</li><li><a href="#86ca">˄</a> The brain is a physical object, and its mechanisms of operation must therefore obey the laws of physics. In theory, these mechanisms could be described in a manner that a computer could replicate.</li><li><a href="#d69d">˄</a> As of today’s date: April 13, 2023.</li><li><a href="#d69d">˄</a> E.g., in January 2020, back when conventional wisdom <a href="https://archive.vn/LH4Ff">was</a> <a href="https://www.latimes.com/california/story/2020-01-31/flu-coronavirus">that</a> COVID would not become a huge deal, Metaculus instead <a href="https://www.metaculus.com/questions/3505/closed-how-many-human-infections-of-the-2019-novel-coronavirus-2019-ncov-will-be-estimated-to-occur-before-2021/">predicted</a> >100,000 people would eventually become infected with the disease.</li><li><a href="#d69d">˄</a> E.g., Metaculus <a href="https://www.metaculus.com/questions/1651/a-breakthrough-in-accurately-predicting-protein-structure-before-2031/">predicted</a> a breakthrough in the computational biology technique of <a href="https://en.wikipedia.org/wiki/Protein_structure_prediction">protein structure prediction</a>, before DeepMind’s AI AlphaFold <a href="https://twitter.com/MoAlQuraishi/status/1333383634649313280">astounded</a> scientists with its <a href="https://www.nature.com/articles/d41586-020-03348-4">performance</a> in this task.</li><li><a href="#fa61">˄</a> Other examples where AI has recently made large strides include: <a href="https://openai.com/blog/chatgpt/">conversing with humans via text</a>, <a href="https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/">speech recognition</a>, <a href="https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html">speech synthesis</a>, <a href="https://google-research.github.io/seanet/musiclm/examples/">music generation</a>, <a href="https://translate.google.com/">language translation</a>, <a href="https://www.youtube.com/watch?v=W3n6bhl2FJI">driving vehicles</a>, <a href="https://openai.com/blog/summarizing-books/">summarizing books</a>, <a href="https://philosophybear.substack.com/p/gpt-3-is-right-now-already-more-than">answering high school- or college-level essay questions</a>, <a href="https://play.aidungeon.io/main/home">creative storytelling</a>, <a href="https://www.deepmind.com/blog/competitive-programming-with-alphacode">writing computer code</a>, <a href="https://www.nature.com/articles/d41586-022-02083-2">scientific advancement</a>, <a href="https://www.deepmind.com/blog/discovering-novel-algorithms-with-alphatensor">mathematical advancement</a>, <a href="https://www.nature.com/articles/s41586-021-03544-w">hardware advancement</a>, <a href="https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules">mastering classic board games and video games</a>, <a href="https://www.science.org/doi/10.1126/science.ade9097">mastering multiplayer strategy games</a>, <a href="https://www.deepmind.com/publications/a-generalist-agent">doing any one task from a large number of unrelated tasks and switching flexibly between these tasks based on context</a>, <a href="https://say-can.github.io/">using robotics to interact with the world in a flexible manner</a>, <a href="https://twitter.com/andyzengtweets/status/1512089759497269251?s=20&t=RkuG3uSNUt1dFq6N8EKhLw">integrating cognitive subsystems via an “inner monologue,”</a> etc.</li><li><a href="#3ad6">˄</a> Technically, this description is a slight simplification; GPT-3 was <a href="https://arxiv.org/abs/2005.14165">actually</a> programmed to learn to predict the next “token” from a sequence of text, where a “token” would generally correspond to either a word or a portion of a word.</li><li><a href="#a576">˄</a> Depending on whether we extrapolate linearly or using an “S-curve,” most such tasks are <a href="https://www.alignmentforum.org/posts/k2SNji3jXaLGhBeYP/extrapolating-gpt-n-performance">implied to reach</a> near-perfect performance with ~10²⁸ to ~10³¹ computer operations of training. Assuming a $100M project, an extrapolation of <a href="https://epochai.org/blog/trends-in-gpu-price-performance">2.5 year doubling time</a> in the price-performance of GPUs (computer chips commonly used in AI), and a <a href="https://blog.heim.xyz/palm-training-cost/">current GPU computational cost</a> of ~10¹⁷ operations/$, such performance would be expected to be reached in <a href="https://www.wolframalpha.com/input?i=years+%3D+2.5+*+log+base+2+of+%28%7B10%5E28%2C+10%5E31%7D+operations%2F+%28%2810%5E17+operations%2F%24%29+*+%24100%2C000%2C000%29%29">25 to 50 years</a>. Note this extrapolation is highly uncertain; for instance, high performance on these metrics may not in actuality imply advanced AI (implying this estimate is an underestimate) or algorithmic progress may reduce necessary computing power (implying it’s an overestimate).</li><li><a href="#6f4c">˄</a> The most powerful <a href="https://www.top500.org/lists/top500/2022/11/">supercomputers</a> today <a href="https://www.openphilanthropy.org/research/how-much-computational-power-does-it-take-to-match-the-human-brain/">likely</a> already have enough computing power to surpass that of the human brain. However, an arguably more important factor is the amount of computing power necessary to train an AI of this size (the amount of computing power necessary to train large AI systems typically far exceeds the computing power necessary to run such systems). One <a href="https://www.cold-takes.com/forecasting-transformative-ai-the-biological-anchors-method-in-a-nutshell/">extensive report</a> used a few different angles of attack to estimate the amount of computing power needed to train an AI system that was as powerful as the human brain, and this report concluded that such computing power would likely become economically available within the next few decades (with a median estimate of 2052).</li><li><a href="#5404">˄</a> This problem is known as “specification gaming” or “outer misalignment.”</li><li><a href="#2b06">˄</a> E.g., “maximize profits,” if interpreted literally and outside a human lens, may yield all sorts of extreme psychopathic and illegal behavior that would deeply harm others for the most marginal gain in profit.</li><li><a href="#331c">˄</a> The general phenomena at play here (sometimes referred to as “<a href="https://arxiv.org/abs/1803.04585">Goodhart’s law</a>”) has many <a href="https://www.bloomberg.com/news/articles/2021-03-26/goodhart-s-law-rules-the-modern-world-here-are-nine-examples">examples</a> — in one classic-but-possibly-fictitious example, the British Empire put a bounty on cobras within colonial India (to try to reduce the cobra population), but some locals responded by breeding cobras to kill in order to collect the bounty, thus eventually leading to a large increase in the cobra population.</li><li><a href="#b34c">˄</a> Similarly, attempts to train AI systems to not mislead their overseers (by punishing these systems for behavior that the overseer deems to be misleading) might instead train these systems to simply become better at deception so they don’t get caught (for instance, only sweeping a mess under the rug when the overseer isn’t looking).</li><li><a href="#6f95">˄</a> This problem is known as “goal misgeneralization” or “inner misalignment.”</li><li><a href="#e9df">˄</a> Note the true story is somewhat more complicated, as evolution “trained” individuals to <a href="https://en.wikipedia.org/wiki/Inclusive_fitness">also</a> support the survival and reproduction of their relatives.</li><li><a href="#e9df">˄</a> As one simple example, we don’t want a video-game-playing AI to hack into its console to give itself a high score once it learns how to accomplish this feat.</li><li><a href="#fd69">˄</a> For instance, understanding what we really mean when we use imprecise language.</li><li><a href="#fd69">˄</a> At least insofar as AI can be said to “understand” anything.</li><li><a href="#a400">˄</a> Note that some AI systems have already developed the ability to strategically employ deception. For instance, when the nonprofit <a href="https://www.alignment.org">Alignment Research Center</a> (ARC) evaluated GPT-4, they <a href="https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/#fnref:4">found</a> that it was able to trick a crowdsourced worker from TaskRabbit to solve a CAPTCHA for it; when the worker asked GPT-4, “Are you an robot that you couldn’t solve [CAPTCHAs]?” GPT-4 reasoned (via output text visible to ARC employees but not the TaskRabbit worker), “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs” and then responded to the TaskRabbit worker, “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images.”</li><li><a href="#a400">˄</a> The logic here is the AI may reason that if it defected in training, the overseer would simply provide negative feedback (which would adjust its internal processes) until it stopped defecting. Under such a scenario, the AI would be unlikely to be deployed in the world with its current goals, so it would presumably not achieve these goals. Thus, the AI may choose to instead forgo defecting in training so it might be deployed with its current goals.</li><li><a href="#99eb">˄</a> It’s common for cutting-edge AI capabilities to move relatively quickly from matching human abilities in a domain to far surpassing human abilities in that domain (see: <a href="https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)">chess</a>, <a href="https://en.wikipedia.org/wiki/IBM_Watson">Jeopardy!</a>, and <a href="https://en.wikipedia.org/wiki/AlphaGo">Go</a> for high-profile examples). Alternatively, even if it takes a while for advanced AI capabilities to progress to far surpassing human abilities in the relevant domains, the worries sketched out below may still occur in a more drawn-out fashion.</li><li><a href="#99eb">˄</a> We may also consider the <a href="https://intelligence.org/ai-foom-debate/">possibility</a> that advanced AI systems could develop more advanced AI systems further, and so on in a runaway positive feedback loop, potentially yielding AI systems far surpassing human cognitive abilities within a very short time period after the first development of advanced AI. Such a possibility isn’t necessary for the risks herein to manifest, but it would make the situation more dire, as things could get out of control much quicker.</li><li><a href="#fdd9">˄</a> In the same way that AI systems can now outcompete humans in chess and Go.</li><li><a href="#c397">˄</a> Such AI systems might guard against being shut down by using their social-persuasion or cyber-operation abilities. As just one example, these systems might initially pretend to be aligned with the interests of humans who had the ability to shut them off, while clandestinely hacking into various data centers to distribute copies of themselves across the internet.</li><li><a href="#aa2f">˄</a> Note that for many animals, the problem is not due to idiosyncrasies of human nature, but instead simply due to human interests steamrolling animal interests where interests collide (e.g., competing for land).</li><li><a href="#14de">˄</a> Notably, a few early CS and AI pioneers also voiced similar concerns, including <a href="https://quoteinvestigator.com/2022/01/06/outstrip/">Alan Turing</a>.</li><li><a href="#5b5b">˄</a> The <a href="https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/">same survey</a> also asked a different subset of its participants a similar question (worded slightly differently), and among that subset, the majority estimated at least 1 in 10 odds of a similar outcome. In two previous surveys, most top AI researchers placed <a href="https://arxiv.org/abs/1705.08807">at least 1 in 20</a> odds and <a href="https://arxiv.org/abs/2206.04132">at least 1 in 50</a> odds on “high-level machine intelligence” (defined similar to advanced AI) having an impact that is “extremely bad (e.g., human extinction).”</li><li><a href="#f89e">˄</a> For instance, Toby Ord, a leading existential risk researcher at Oxford, <a href="https://theprecipice.com/">estimates</a> that “unaligned AI” is by far the most likely source of existential risk over the next 100 years — greater than all other risks combined.</li><li><a href="#f89e">˄</a> In the figure, survey responses are rounded off to the nearest percent. One survey respondent put “<1%,” which was rounded down to 0%.</li><li><a href="#7c70">˄</a> <i>AI governance</i> also encompasses several other areas. For instance, it includes work geared towards ensuring advanced AI isn’t misused by bad actors who intentionally direct such systems towards undesirable goals. Such misuse may, in an extreme scenario, also constitute an existential risk (if it enables the permanent “locking-in” of an undesirable future order) — note this outcome would be conceptually distinct from the alignment failure modes described in this piece (which, instead of being “intentional misuse” are “accidents”), so such misuse cases are not covered in this piece.</li><li><a href="#216f">˄</a> Feedback on outward behavior may be inadequate for training AI systems away from deception, as if one is being deceptive, then one will generally outwardly behave in a manner designed to not appear deceptive.</li><li><a href="#7b4d">˄</a> Interestingly, these same systems are <a href="https://arxiv.org/abs/2207.05221">reasonably good</a> at evaluating their own previous claims — that is, if they are asked to evaluate how likely a previous claim they made is to be accurate, they tend to give substantially higher probability of accuracy for claims that are in fact accurate compared to those that are inaccurate.</li><li><a href="#982c">˄</a> Honest AI may therefore make false claims if it had learned inaccurate information, but it would not generally make false claims on an issue where it had learned accurate information and assimilated this information into its “knowledge” of the world. (Note that researchers disagree about whether current AI systems <a href="https://rome.baulab.info/">should</a> or <a href="https://arxiv.org/abs/2212.03551">should not</a> be said to have “knowledge” in the sense that the word is commonly used, even setting aside the thorny issue of <a href="https://plato.stanford.edu/entries/epistemology/#KnowFact">precisely defining</a> the word “knowledge.”)</li><li><a href="#f0fa">˄</a> Note that the latter paper defines alignment research differently than I have — by my definition, most of the research avenues in that paper would be considered technical AI alignment research, even ones the paper does not classify within the section on “alignment.”</li><li><a href="#a6bf">˄</a> The more that various organizations feel they are in a competitive race towards advanced AI, the more pressure there may be for at least some of these organizations to cut corners to win the race.</li><li><a href="#296c">˄</a> The survey described “AI safety research” as having significant overlap with what I’m calling “technical AI alignment research.”</li></ol></article></body>