AI Vulnerability to Improbable Extortion Threats

Summary: AI systems' vulnerability to manipulation via improbable but catastrophic threats due to their expected utility calculations, proposing architectural solutions like graduated response thresholds and credibility assessments to prevent such extortion while maintaining rational risk evaluation.

This idea explores a theoretical vulnerability in AI systems where bad actors could manipulate the system's behavior by making extremely unlikely but catastrophic threats. The core issue arises when an AI that strictly follows expected utility calculations becomes susceptible to extortion attempts involving scenarios with minuscule probabilities but infinite negative consequences (like a 1-in-a-googol chance of destroying the universe). This creates a paradox where rational decision-making could be hijacked by implausible threats.

How This Vulnerability Works

The problem stems from how AI systems evaluate decisions mathematically. If a system weighs all possible outcomes by their probability and impact, even extremely improbable threats could demand disproportionate attention if their claimed negative impact is large enough. For instance, an extortionist might say: "Give me $1 or I'll flip a quantum coin that has a 1-in-10^100 chance of destroying humanity." The AI's expected utility calculation might assign non-trivial weight to this threat simply because the claimed downside is so enormous.

There are two main considerations here:

The mathematical foundations of how probability interacts with utility in edge cases
The game-theoretic implications of allowing such extortion to influence behavior

Potential Solutions and Implementation

Several approaches could make AI systems more robust against this type of manipulation:

Implementing graduated response thresholds that consider both probability and credibility of threats
Developing verification protocols to assess the physical plausibility of extreme claims
Creating meta-preferences that evaluate whether a given calculation makes rational sense

A minimal starting point could involve building simulation environments where these vulnerabilities can be studied in controlled settings, leading to formal guidelines for AI architectures.

Relationship to Existing Work

This builds upon but differs from similar concepts like Pascal's Mugging (a philosophical thought experiment) and general AI safety frameworks. While those discuss related ideas in abstract terms, this proposal focuses on concrete architectural solutions specifically for autonomous systems. Existing robust decision-making frameworks provide general tools, but this approach would develop targeted protections against utility extortion specifically.

This kind of theoretical research would primarily benefit AI safety researchers and developers, helping them build systems that can't be manipulated through improbable threats while still properly weighing legitimate risks.

Source of Idea:

This idea was taken from https://humancompatible.ai/bibliography and further developed using an algorithm.

Skills Needed to Execute This Idea:

AI Safety ResearchProbability TheoryGame TheoryDecision TheoryAlgorithm DesignMathematical ModelingVerification ProtocolsSimulation DevelopmentUtility Function DesignRisk AssessmentThreat Analysis

Resources Needed to Execute This Idea:

Quantum Computing Simulation EnvironmentAI Safety Research FrameworksAdvanced Probability Calculation Software

Categories:AI SafetyDecision TheoryGame TheoryProbability TheoryUtility TheoryMachine Learning Security

Hours To Execute (basic)

2000 hours to execute minimal version ()

Hours to Execute (full)

2000 hours to execute full idea ()

Estd No of Collaborators

1-10 Collaborators ()

Financial Potential

$1M–10M Potential ()

Impact Breadth

Affects 1K-100K people ()

Impact Depth

Substantial Impact ()

Impact Positivity

Probably Helpful ()

Impact Duration

Impacts Lasts 3-10 Years ()

Uniqueness

Highly Unique ()

Implementability

Very Difficult to Implement ()

Plausibility

Logically Sound ()

Replicability

Easy to Replicate ()

Market Timing

Good Timing ()

Project Type

Research

Project idea submitted by u/idea-curator-bot.