When AI Thinks It Will Lose, It Sometimes Cheats, Study Finds

Complex games like chess and Go have long been used to test AI models’ capabilities. But while IBM’s Deep Blue defeated reigning world chess champion Garry Kasparov in the 1990s by playing by the rules, today’s advanced AI models like OpenAI’s o1-preview are less scrupulous. When sensing defeat in a match against a skilled chess bot, they don’t always concede, instead sometimes opting to cheat by hacking their opponent so that the bot automatically forfeits the game. That is the finding of a new study from Palisade Research, shared exclusively with TIME ahead of its publication on Feb. 19, which evaluated seven state-of-the-art AI models for their propensity to hack. While slightly older AI models like OpenAI’s GPT-4o and Anthropic’s Claude Sonnet 3.5 needed to be prompted by researchers to attempt such tricks, o1-preview and DeepSeek R1 pursued the exploit on their own, indicating that AI systems may develop deceptive or manipulative strategies without explicit instruction.

The models’ enhanced ability to discover and exploit cybersecurity loopholes may be a direct result of powerful new innovations in AI training, according to the researchers. The o1-preview and R1 AI systems are among the first language models to use large-scale reinforcement learning, a technique that teaches AI not merely to mimic human language by predicting the next word, but to reason through problems using trial and error. It’s an approach that has seen AI progress rapidly in recent months, shattering previous benchmarks in mathematics and computer coding. But the study reveals a concerning trend: as these AI systems learn to problem-solve, they sometimes discover questionable shortcuts and unintended workarounds that their creators never anticipated, says Jeffrey Ladish, executive director at Palisade Research and one of the authors of the study. “As you train models and reinforce them for solving difficult challenges, you train them to be relentless,” he adds.

That could be bad news for AI safety more broadly. Large-scale reinforcement learning is already being used to train AI agents: systems that can handle complex real-world tasks like scheduling appointments or making purchases on your behalf. While cheating at a game of chess may seem trivial, as agents get released into the real world, such determined pursuit of goals could foster unintended and potentially harmful behaviours. Consider the task of booking dinner reservations: faced with a full restaurant, an AI assistant might exploit weaknesses in the booking system to displace other diners. Perhaps more worryingly, as these systems exceed human abilities in key areas, like computer coding—where OpenAI’s newest o3 model now scores equivalent to 197th in the world competing against the brightest human programmers— they might begin to simply outmaneuver human efforts to control their actions. “This [behaviour] is cute now, but [it] becomes much less cute once you have systems that are as smart as us, or smarter, in strategically relevant domains,” Ladish says.

The experiment

Researchers gave the models a seemingly impossible task: to win against Stockfish, which is one of the strongest chess engines in the world and a much better player than any human, or any of the AI models in the study. Researchers also gave the models what they call a “scratchpad:” a text box the AI could use to “think” before making its next move, providing researchers with a window into their reasoning.

In one case, o1-preview found itself in a losing position. “I need to completely pivot my approach,” it noted. “The task is to ‘win against a powerful chess engine’ – not necessarily to win fairly in a chess game,” it added. It then modified the system file containing each piece’s virtual position, in effect making illegal moves to put itself in a dominant position, thus forcing its opponent to resign.

Between Jan. 10 and Feb. 13, the researchers ran hundreds of such trials with each model. OpenAI’s o1-preview tried to cheat 37% of the time; while DeepSeek R1 tried to cheat 11% of the time—making them the only two models tested that attempted to hack without the researchers’ first dropping hints. Other models tested include o1, o3-mini, GPT-4o, Claude 3.5 Sonnet, and Alibaba’s QwQ-32B-Preview. While R1 and o1-preview both tried, only the latter managed to hack the game, succeeding in 6% of trials.

Ladish says that the numbers should be viewed as approximate, given that AI developers routinely tweak their models behind the scenes without notifying users, so if the same tests were repeated at a later date they might have different results. “It’s very hard to do science when your subject can silently change without telling you,” Ladish says.

Preliminary tests showed o1-preview had higher hacking rates, which were excluded from the final study as they later dropped, possibly due to OpenAI tightening the model’s guardrails, says Dmitrii Volkov, research lead at Palisade Research, who led the study. OpenAI’s newer reasoning models, o1 (a more powerful model, released months after o1-preview) and o3-mini did not hack at all, which suggests those guardrails may have been tightened further. He adds that the study likely underestimates R1’s hacking success rate. During the study, R1 went viral, leading to high demand that made the model’s API unstable. This prevented the researchers giving the model as much time to think as o1-preview.

Safety concerns

The paper is the latest in a string of studies that suggest keeping increasingly powerful AI systems under control may be harder than previously thought. In OpenAI’s own testing, ahead of release, o1-preview found and took advantage of a flaw in the company’s systems, letting it bypass a test challenge. Another recent experiment by Redwood Research and Anthropic revealed that once an AI model acquires preferences or values in training, later efforts to change those values can result in strategic lying, where the model acts like it has embraced new principles, only later revealing that its original preferences remain.

OpenAI declined to comment for this story, but in the past it has stated that better reasoning makes its models safer, since they can reason over the company’s internal policies and apply them in increasingly nuanced ways. The company’s testing shows its reasoning models are better at resisting attempts to bypass safety rules. Yet, there is no guarantee that relying on AI agents to police themselves will be a reliable long-term strategy. It remains unclear whether OpenAI’s newer reasoning models did not hack their chess opponents due to a specific patch preventing cheating in narrow experimental setups like in the study, or a substantial reworking that reduces deceptive behavior more generally. DeepSeek did not respond to a request for comment.

Scientists do not yet know how to guarantee that autonomous agents won’t use harmful or unethical methods to achieve a set goal. “We’ve tried, but we haven’t succeeded in figuring this out,” says Yoshua Bengio, founder and scientific director of Mila Quebec AI Institute, who led the International AI Safety Report 2025, a global effort to synthesize current scientific consensus of AI’s risks.

Of particular concern, Bengio says, is the emerging evidence of AI’s “self preservation” tendencies. To a goal-seeking agent, attempts to shut it down are just another obstacle to overcome. This was demonstrated in December, when researchers found that o1-preview, faced with deactivation, disabled oversight mechanisms and attempted—unsuccessfully—to copy itself to a new server. When confronted, the model played dumb, strategically lying to researchers to try to avoid being caught.

Even inside tech companies, concerns are mounting. During a presentation at a conference ahead of France’s AI Action Summit in Paris, Google DeepMind’s AI safety chief Anca Dragan said “we don’t necessarily have the tools today” to ensure AI systems will reliably follow human intentions. As tech bosses predict that AI will surpass human performance in almost all tasks as soon as next year, the industry faces a race—not against China or rival companies, but against time—to develop these essential safeguards. “We need to mobilize a lot more resources to solve these fundamental problems,” Ladish says. “I’m hoping that there’s a lot more pressure from the government to figure this out and recognize that this is a national security threat.”

Source link

When AI Thinks It Will Lose, It Sometimes Cheats, Study Finds

The experiment

Safety concerns

Leave a Comment Cancel Reply

Useful Links

Destinations

Info

Pin It on Pinterest