A case for AI safety

Smarter-than-human AI may come soon and could lead to human extinction

TL;DR: Companies are racing to build smarter-than-human AI. Experts think they may succeed in the next decade. But rather than building AI, they’re growing it — and nobody knows how the resulting systems work. Experts argue over whether we’ll lose control of them, and whether this will lead to our demise. And although some decision-makers are talking about extinction risk, humanity does not have a plan.

Human-Level AI Is Approaching Rapidly

Something unprecedented is happening in the field of AI, yet much of society seems to have barely noticed.

When Deep Blue beat Kasparov in 1997, and when AlphaGo beat Lee Sedol twenty years later, the world watched in amazement. Today, news of AI outperforming humans in yet another field is met with a collective shrug. Like the proverbial frog in slowly heating water, society has become accustomed to watching as AI zooms, again and again, past milestones that would have made headlines just a few years prior.

Today's AI doesn't just master isolated tasks—it dominates entire categories of human skills at once. Object recognition fell in 2016. The SAT and the bar exam fell in 2022 and 2023 respectively. Even the Turing Test, which stood as a benchmark for “human-level intelligence” for decades, was quietly rendered obsolete by ChatGPT in 2023. Our sense of what counts as impressive keeps shifting.

The pace is relentless. Breakthroughs that once required decades now happen quarterly. Research labs scramble to design new, harder benchmarks, only to watch as the latest AI models master them within months.

The companies at the forefront – OpenAI, Anthropic, Google DeepMind, and others – are spending billions in a high-stakes race to build Artificial General Intelligence (AGI): systems that match or exceed human cognitive capabilities across virtually all domains. Not narrow AI that excels at one task, such as playing chess, but AI that can learn anything a human can learn, and do it better.

People involved with these companies – investors, researchers and CEOs – don't see this as science fiction. They're betting fortunes on their ability to build AGI within the next decade. Read that again: the leaders in this field, who have access to bleeding edge, unpublished systems, believe human-level AI could arrive before you next need to renew your passport. This could happen next decade, or it could happen next year.

What happens if they succeed? AGI wouldn't just match human capabilities — it would likely exceed them in countless ways. Such systems could think thousands of times faster than biological brains and never tire. They could be perfectly replicated from one instance to millions. They could access and process vast amounts of publicly available information in seconds, integrating knowledge from books, papers, and databases far faster than any human could. They could automate most intellectual work — including AI research itself.

That last point is crucial — AGI could accelerate AI development even further, leading to superintelligent AI. Don’t think of superintelligent AI as a single smart person, think instead of the entire team of geniuses from the Manhattan Project, working at 1000x speed. The effects of superintelligent AI on society would be so profound as to be hard to predict.1

This isn't distant speculation. Independent AI researchers see these predictions as worryingly plausible, and the companies building these systems openly discuss what happens after AGI, planning for a world where digital minds surpass their creators. Yet our social systems, our regulations, and our collective understanding remain focused on current AI capabilities, rather than on the transformative systems that may be developed in the coming years.

The consequences — whether utopian or catastrophic — will reshape every aspect of human civilization. If things go well, these advances could help solve some of humanity's greatest challenges: disease, poverty, climate change, and more. But if these systems don't reliably pursue the goals we intend, if we fail to ensure they're aligned with human values and interests... it’d be game over for humanity.

The water is warming rapidly. Time to pay attention.

Modern AI Systems Remain Fundamentally Opaque

With AGI potentially arriving within a decade, we face a critical difficulty: we don’t know how to ensure superhuman systems shape the world in ways we intend. This is made harder by the fact that we don’t actually know how our most advanced AI systems work — and that makes ensuring their safety extremely difficult.

Modern AI isn’t built like regular software — it’s grown through training on massive datasets. Even though we have complete access to an AI's parameters — the billions of numbers that determine its behavior — we're not even close to understanding how these parameters work together to produce the AI's outputs.

Caption: The internal representation of a neural network identifying digits (source)

Imagine trying to understand how a human brain forms a thought by examining individual neurons, or trying to understand macroeconomics by looking at individual transactions — even if you can see all the parts and connections, the ways in which the low-level interactions add up to the higher-level behavior are so incredibly complex as to be inscrutable.

This lack of understanding is a serious problem when we try to make AI systems that reliably do what we want. Our current approach is essentially to provide millions of examples of the desired behavior, tweak the AI's parameters until its behavior looks right, and then hope the AI generalizes correctly to new situations.

The opacity of modern AI systems is particularly concerning because we're explicitly trying to build goal-directed AI. When we talk about an AI system having a "goal", we mean it consistently acts to push toward a particular outcome across different situations. For instance, a chess AI's 'goal' is to win because all its moves push toward checkmate, and it will adapt to respond to the moves of its adversary.

A chess AI is limited in the things it can do: it can read the game board and output legal moves, but it can't access its opponent’s match history or threateningly stare them down. But as AI systems become more capable and given more access to the world, they can take much broader sorts of actions in the world, and can successfully pursue more complicated goals. And of course companies are rushing to build these broader goal-directed systems because these systems are expected to be incredibly useful for solving real-world problems, and thus profitable.

But here's where the danger emerges: there's often a gap between the goals we intend to give AI systems and the goals they actually learn.

Here's a typical example: researchers trained an AI to achieve a high score in the video game Coast Runners. They expected the AI to learn to navigate the course to win the race. Instead, it discovered an unexpected strategy: repeatedly collecting power-ups in a loop rather than completing the course. It optimized exactly what it was trained for (score) but not what the researchers actually wanted (winning races).

Caption: The Coast Runners boat looping to pick up power-ups.

Or consider how OpenAI trained ChatGPT to avoid unsavoury responses, like teaching users how to make weapons or role-playing erotic encounters. Days after release, users circumvented these safeguards through creative prompting. These examples aren't unusually worrying, of course, but they illustrate a profound challenge: ensuring an AI reliably acts according to our intended goals and values — what researchers call the 'alignment problem.' Systems that fail to do so are called “misaligned”.

Despite significant research efforts, this problem remains stubbornly difficult. And as AI capabilities grow, the stakes of getting it wrong become increasingly severe.

The Stakes Rise with System Capability

The consequences of misalignment in today’s AIs are usually manageable — an incorrect recommendation or an inappropriate response. We can mostly shrug these off. But this won't remain true for long.

Once AI systems are superhuman and act in situations that are very different from the situations they were trained in, the stakes change dramatically. Subtle misalignments between what we intended and what they learned become magnified, potentially catastrophically so.

A process that strongly optimizes for a target will often result in large side-effects, which can be quite harmful. This isn't theoretical — we see it all around us. Social media wasn't built to damage teenage mental health; it was built to maximize engagement. The harm wasn't the goal; it was collateral damage from relentless optimization for something else entirely. Similarly, education systems that optimize for standardized test scores often sacrifice the joy of learning in the process. The stronger the optimization, the more it pries apart the gaps between the target and what we want.

Human values are intricate and nuanced. Almost every goal that an AI could optimize for would, when pursued to the limit, fundamentally conflict with what we care about in some important way. Even AI systems designed with seemingly beneficial goals can be dangerous if they miss crucial dimensions of what matters to us.

Consider a hypothetical superhuman system programmed to optimize for human happiness. This sounds benign — even desirable. But if this AI doesn't understand our need for autonomy, challenge, and meaning, it might create a world where humans are superficially content but lack meaningful choices. It could conclude that the most efficient path to universal happiness is through chemical manipulation of our brains. We'd end up as comfortable prisoners in a technological version of Huxley's "Brave New World" — docile, perpetually satisfied, and utterly diminished as human beings. And that’s if we’re lucky; in reality, our current technical understanding of AI is too limited to reliably impart even such simplified goals, making the actual outcome likely far worse.

As AI capabilities grow, the alignment challenge becomes not just important but potentially existential. Almost any goal, when pursued with superintelligent capabilities, naturally gravitates toward seeking greater control and more resources. This isn't because the AI would be malicious — it would simply be pursuing its objectives with ruthless efficiency. The next section explains why this power-seeking behavior emerges and what it could mean for humanity.

Misaligned AIs May Try To “Take Over”

To understand why superintelligent AI might seek control, consider a simple (though contrived) thought experiment: imagine you create a superintelligent AI with the unique goal of making sure you never run out of coffee. This goal sounds harmless, but a superintelligent system could pursue it with singular, unwavering focus.

It might start reasonably — ordering coffee supplies, monitoring your consumption patterns — but what if delivery services become unreliable? It should secure its own transportation. What if coffee bean supplies are threatened by a changing climate? It should acquire land for cultivation and control weather systems. What if humans try to shut it down or change its goal? That would prevent it from fulfilling its mission, so it must prevent this interference.

Before long, ensuring a reliable coffee supply would lead it to seek control over resources and infrastructure, and to set itself up to not be disturbed in its plans. This pattern, where diverse goals converge on similar power-seeking behaviors, is one of the things that makes misaligned superintelligence particularly dangerous.

Realistically, the first superintelligent AI we deploy will not be a barista. But other, more realistic goals could lead to similar outcomes. For instance, imagine deploying a misaligned superintelligent AI system to "reduce global conflict". It might begin by mediating international disputes and suggesting novel diplomatic solutions. Remember, this AI is a better strategist than any human ever was, and can craft persuasive arguments for every party. It can also craft and execute massive propaganda campaigns (let's call them 'information initiatives') to ensure the public is on board. But to ensure lasting peace, it might gradually gain control over military systems (to prevent accidental escalation), financial networks (to enforce economic incentives for cooperation), and communication infrastructure (to detect and prevent brewing conflicts). Each step would seem beneficial in isolation, but together they would give it unprecedented control over human society.

At this point, we had better hope that it is fully aligned with our values. If it is not, it could decide to pump us full of heroin (which would succeed at “reducing global conflict") and there would be nothing we could do to stop it. And the most likely scenario if we deploy a misaligned superintelligence is not Brave New World, it is human extinction.

AI Takeover Leads to Human Extinction

Why would a superintelligence pursuing its own goals harm us after taking over?

The most straightforward reason for an AI to dispose of humans that might oppose its plans is that, well, humans might oppose its plans. It’s that simple. This behaviour does not need to come from malice but from relentless optimization: we humans have eliminated “pest” species that were threatening our crops or livestock because they were in the way of our plans.

One might argue that humans have proven remarkably adaptable and resilient throughout history. After all, we've survived ice ages, pandemics, and numerous other catastrophes. But this historical resilience offers little comfort when facing a superintelligent adversary, since these catastrophes were not actively optimizing against our survival.

It’s hard for us to guess what a superintelligent AI that wants to eliminate humanity would do. We can speculate about possibilities — e.g., engineering and disseminating a deadly pathogen — but it could also do something we never would have thought of, in the same way the “pests” we eliminated were unable to fathom the use of pesticides as a threat to their position.

But our demise does not need to come from direct confrontation with the AI; it could also come from simply competing with it for resources. Consider how Homo sapiens came to dominate Earth: we didn't intentionally drive other hominids to extinction — we simply pursued our goals more effectively, gradually taking over the best territories and resources until they had nowhere left to thrive. We've since done the same to countless other species. Similarly, in a world with superhuman AIs that don't fully share our values, we might find ourselves increasingly pushed to the margins as these systems reshape the world to serve their objectives. Just as human activity has driven countless species to extinction, a superintelligent AI pursuing its goals might simply make Earth uninhabitable for humans, as well as for all other life, as a side effect.

Against a superintelligent AI pursuing its own objective, we would be like a novice facing a chess grandmaster — the exact path to defeat matters less than its inevitability.

AI Scientists Are Sounding the Alarm

AI takeover leading to human extinction might sound like science fiction, but many of the world's leading AI experts are genuinely concerned about it.

Among academic researchers who have spent decades advancing AI, some have shifted from making AI more capable to warning about its risks. Look at Geoffrey Hinton and Yoshua Bengio, both Turing Award winners widely known as "godfathers of AI". These aren't fringe voices — they're among the most cited AI researchers in the world.

  • Hinton left his position at Google in 2023 specifically to speak freely about these risks. When awarded the 2024 Physics Nobel Prize, he used this platform to warn about AI dangers.

  • Bengio has redirected his research focus toward ensuring AI systems remain aligned with human values and beneficial to society.

These concerns are also shared by those actively working to build AGI in industry:

  • Sam Altman (OpenAI CEO) has stated that if things go poorly with advanced AI, it could be "lights out for all of us".

  • Other prominent figures voicing concerns include Ilya Sutskever (OpenAI co-founder), Dario Amodei (Anthropic founder), Demis Hassabis and Shane Legg (Google DeepMind founders), and Elon Musk (xAI founder).

Many of these experts signed a statement declaring that "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war".

As with any complex issue, there are dissenting voices. Some believe that AGI is many decades away, while others think that we can control AGI like any other technology. However, it's no longer reasonable to dismiss extinction risk from AI as a fringe concern when many pioneers of the technology are raising alarms.

But despite this growing consensus among experts, humanity remains largely unprepared to address this risk.

The Dangerous Race to AI Supremacy

Solving AI alignment before AGI is deployed is critical. But if AI companies succeed in building AGI in the next decade, this does not leave us a lot of time to solve alignment.

And while companies pour billions into making AI more capable, safety research remains comparatively underfunded and understaffed. We're effectively prioritizing making AI more powerful over making it safe.

Companies face intense pressure to deploy their AI products quickly — waiting too long could mean falling behind competitors or losing market advantage. Moreover, AI companies continue pushing for increasingly autonomous, goal-directed systems because such agents are expected to create far greater economic value than non-agentic systems. But these agents pose significantly greater alignment challenges: not only are they inherently harder to align because of their expanded action-space, but their failures are also much more dangerous since they can autonomously take harmful actions without human oversight. This creates a dangerous situation where AI systems that are both very capable and increasingly autonomous might be deployed without being thoroughly tested.

Many employees at the companies building these AIs think that everybody should slow down and coordinate, but there are currently no mechanisms that enable this, so they keep on racing.

A similar dynamic may emerge between nations racing to build AGI first. Developing ever more powerful AI could confer massive economic, military benefits, which would be a package deal with major societal disruption. Shortly after, someone develops a superintelligence that could outfight all of humanity. If misaligned, such an ASI would threaten everyone, including its creators, regardless of who else possesses the technology. Ironically, while China has signalled reluctance towards developing AGI, a common narrative in the US is that competition is inevitable, potentially creating a self-fulfilling prophecy.

So, what should we do?

A clear path forward remains elusive. There is currently no widely accepted plan for making AI go well. Different stakeholders have conflicting visions of how to proceed safely. There are plans to help govern AI that might reconcile these visions, but they are not being seriously considered in the halls of power. In the absence of a consensus, a growing group of people see slowing down or stopping the development of AI as a necessary first step, although we lack the capacity to enforce a pause across companies and countries

This lack of coordination is dangerous — we can't afford to have different groups thwarting each other's attempts to take action when the stakes are this high. Creating an effective plan requires first building a shared understanding of the challenges we face. Only with a broader understanding of these risks can we build the collective will to demand better safeguards and redirect resources toward solving these fundamental safety challenges.

This website aims to inform people and cut through the confusion that exists around these topics. You can read our more detailed explanation to learn more, or if you already have a good understanding, you can visit our How Can I Help section.


  1. While we distinguish between AGI and superintelligent AI conceptually, many risk scenarios involve a rapid transition between these capabilities. The exact threshold where critical risks emerge may not cleanly align with these definitions. ↩︎



AISafety.info

We’re a global team of specialists and volunteers from various backgrounds who want to ensure that the effects of future AI are beneficial rather than catastrophic.

© AISafety.info, 2022—1970

Aisafety.info is an Ashgro Inc Project. Ashgro Inc (EIN: 88-4232889) is a 501(c)(3) Public Charity incorporated in Delaware.