This idea explores a theoretical vulnerability in AI systems where bad actors could manipulate the system's behavior by making extremely unlikely but catastrophic threats. The core issue arises when an AI that strictly follows expected utility calculations becomes susceptible to extortion attempts involving scenarios with minuscule probabilities but infinite negative consequences (like a 1-in-a-googol chance of destroying the universe). This creates a paradox where rational decision-making could be hijacked by implausible threats.
The problem stems from how AI systems evaluate decisions mathematically. If a system weighs all possible outcomes by their probability and impact, even extremely improbable threats could demand disproportionate attention if their claimed negative impact is large enough. For instance, an extortionist might say: "Give me $1 or I'll flip a quantum coin that has a 1-in-10^100 chance of destroying humanity." The AI's expected utility calculation might assign non-trivial weight to this threat simply because the claimed downside is so enormous.
There are two main considerations here:
Several approaches could make AI systems more robust against this type of manipulation:
A minimal starting point could involve building simulation environments where these vulnerabilities can be studied in controlled settings, leading to formal guidelines for AI architectures.
This builds upon but differs from similar concepts like Pascal's Mugging (a philosophical thought experiment) and general AI safety frameworks. While those discuss related ideas in abstract terms, this proposal focuses on concrete architectural solutions specifically for autonomous systems. Existing robust decision-making frameworks provide general tools, but this approach would develop targeted protections against utility extortion specifically.
This kind of theoretical research would primarily benefit AI safety researchers and developers, helping them build systems that can't be manipulated through improbable threats while still properly weighing legitimate risks.
Hours To Execute (basic)
Hours to Execute (full)
Estd No of Collaborators
Financial Potential
Impact Breadth
Impact Depth
Impact Positivity
Impact Duration
Uniqueness
Implementability
Plausibility
Replicability
Market Timing
Project Type
Research