Grant / January 2020

Corrigibility in Artificial Intelligence Systems

This project will focus on basic security issues for advanced AI systems. It anticipates a time when AI systems are capable of devising behaviors that circumvent simple security policies such as turning the machine off. These behaviors, which may include deceiving human operators and disabling the off switch, result not from spontaneous evil intent but from the rational pursuit of human-specified objectives in complex environments. The main goal of our research is to design incentive structures that provably lead to corrigible systems systems whose behavior can be corrected by human input during operation.

Topics

#artificial intelligence (AI)

Related Research

Blog Post

May 13, 2025

Reflections on Cybersecurity Futures 2025: Looking Back from the Present
White Paper

May 8, 2025

Survey of Search Engine Safeguards and their Applicability for AI
White Paper

February 3, 2025

Intolerable Risk Threshold Recommendations for Artificial Intelligence

Corrigibility in Artificial Intelligence Systems

Topics

Related Research

Reflections on Cybersecurity Futures 2025: Looking Back from the Present

Survey of Search Engine Safeguards and their Applicability for AI

Intolerable Risk Threshold Recommendations for Artificial Intelligence

Help build and expand our future-focused research agenda

Subscribe to our mailing list