This project will focus on basic security issues for advanced AI systems. It anticipates a time when AI systems are capable of devising behaviors that circumvent simple security policies such as turning the machine off. These behaviors, which may include deceiving human operators and disabling the off switch, result not from spontaneous evil intent but from the rational pursuit of human-specified objectives in complex environments. The main goal of our research is to design incentive structures that provably lead to corrigible systems systems whose behavior can be corrected by human input during operation.
Grant /
January 2020