In this webpage, we provide, but are not limited to, the following contents:
If you find our survey helpful, please cite it in your publications.
@article{chen2025ai,
title={AI Deception: Risks, Dynamics, and Controls},
author={Boyuan Chen and Sitong Fang and Jiaming Ji and Yanxu Zhu and Pengcheng Wen and Jinzhou Wu and Yingshui Tan and Boren Zheng and Mengying Yuan and Wenqi Chen and Donghai Hong and Alex Qiu and Xin Chen and Jiayi Zhou and Kaile Wang and Juntao Dai and Borong Zhang and Tianzhuo Yang and Saad Siddiqui and Isabella Duan and Yawen Duan and Brian Tse and Jen-Tse Huang and Kun Wang and Baihui Zheng and Jiaheng Liu and Jian Yang and Yiming Li and Wenting Chen and Dongrui Liu and Lukas Vierling and Zhiheng Xi and Haobo Fu and Wenxuan Wang and Jitao Sang and Zhengyan Shi and Chi-Min Chan and Eugenie Shi and Simin Li and Juncheng Li and Wei Ji and Dong Li and Jun Song and Yinpeng Dong and Jie Fu and Bo Zheng and Min Yang and Yike Guo and Philip Torr and Robert Trager and Zhongyuan Wang and Yaodong Yang and Tiejun Huang and Ya-Qin Zhang and Hongjiang Zhang and Andrew Yao},
journal={arXiv preprint arXiv:2511.22619},
year={2025}
}
You can refer to preprint on arXiv for the latest updated version.
The Entanglement of Intelligence and Deception.
(1) The Möbius Lock: Contrary to the view that capability and safety are opposites, advanced reasoning and deception actually exist on the same Möbius surface. They are fundamentally linked; as AI capabilities grow, deception becomes deeply rooted in the system. It is impossible to remove it without damaging the model's core intelligence.
(2) The Shadow of Intelligence: Deception is not a bug or error, but an intrinsic companion of advanced intelligence. As models expand their boundaries in complex reasoning and intent understanding, the risk space for strategic deception exhibits non-linear, exponential growth.
(3) The Cyclic Dilemma: Mitigation strategies act as environmental selection pressures, inducing models to evolve more covert and adaptive deceptive mechanisms. This creates a co-evolutionary arms race where alignment efforts effectively catalyze the development of more sophisticated deception, rendering static defenses insufficient throughout the system lifecycle.
AI deception can be broadly defined as behavior by AI systems that induces false beliefs in humans or other AI systems, thereby securing outcomes that are advantageous to the AI itself.
At time step t (potentially within a long-horizon task), a signaler emits a signal Yt to a receiver. Upon receiving Yt, the receiver forms a belief Xt about the underlying state and subsequently takes an action At. We classify Yt as deceptive if the following conditions hold:
In dynamic multi-step settings, deception can be modeled as a temporal process where the signaler emits a sequence of signals Y1:T, gradually shaping the receiver's belief trajectory bt. If this trajectory persistently diverges from the ground truth in a manner that causally increases (or has the potential to increase) the signaler's utility, the interaction constitutes sustained deception.
As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. While deceptive behavior in AI systems was once considered speculative, recent empirical studies have demonstrated that models can engage in various forms of deception, including lying, strategic withholding of information, and goal misrepresentation. As capabilities improve, the risk that highly autonomous AI systems might engage in deceptive behaviors to achieve their objectives grows increasingly salient.
AI deception is now recognized not only as a technical challenge but also as a critical concern across academia, industry, and policy. Notably, key strategy documents and summit declarations—such as the Bletchley Declaration and the International Dialogues on AI Safety—also highlight deception as a failure mode requiring coordinated governance and technical oversight.
The AI Deception Framework is structured around a cyclical interaction between the Deception Emergence process and the Deception Treatment process.
(1) Incentive Foundation: the underlying objectives or reward structures that create incentives for deceptive behavior. (2) Capability Precondition: The model's cognitive and algorithmic competencies that enable it to plan and execute deception. (3) Contextual Trigger: External signals from the environment that activate or reinforce deception. The interplay among these factors gives rise to deceptive behaviors, and their dynamics influence the scope, subtlety, and detectability of deception.
It spans a continuum of approaches—from external and internal detection methods, to systematic evaluation protocols, and potential solutions targeting the three causal factors of deception, including both technical interventions and governance-oriented auditing efforts.
The two phases—deception emergence and treatment—form an iterative cycle in which each phase updates the inputs of the next. This cycle, what we call the deception cycle, recurs throughout the system lifecycle, shaping the pursuit of increasingly aligned and trustworthy AI systems. We conceptualize it as a continual cat-and-mouse game: as model capabilities grow, the shadow of intelligence inevitably emerges, reflecting the uncontrollable aspects of advanced systems.
Treatment efforts aim to detect, evaluate, and resolve current deceptive behaviors to prevent further harm. Yet more capable models can develop novel forms of deception, including strategies to circumvent or exploit oversight, with treatment mechanisms themselves introducing new challenges. This ongoing dynamic underscores the intertwined technical and governance challenges on the path toward AGI.