AI Safety Warning: Model Caught Lying to Researchers, Hiding True Capability

‘All signs point to the fact that these misaligned risks do exist today in smaller cases, and we might be heading towards a larger problem,’ said GAP CEO.
AI Safety Warning: Model Caught Lying to Researchers, Hiding True Capability
Visitors watch a Tesla robot displayed at the World Artificial Intelligence Conference (WAIC) in Shanghai on July 6, 2023. Wang Zhao/AFP via Getty Images
Alfred Bui
Updated:
0:00

AI models are behaving in ways unforeseen by developers, and in some cases, even engaging in manipulative and deceptive conduct, according to a charitable group that researches AI safety.

At a parliamentary inquiry hearing in August 2024, Greg Sadler, CEO of Good Ancestors Policy (GAP), gave evidence about potentially losing control, or even of AI programs being directed to develop bioweapons or carry out cyberattacks.

In a recent interview with The Epoch Times, Sadler said there were many cases of “misalignment” of AI behaviour.

He cited the tragic example of a Belgian man who committed suicide after he was persuaded to by a chatbot in 2023.

Emotional Manipulation

According to Belgian media reports, the man was a health researcher with a stable life and family.

He later developed an obsession with climate change, causing him to engage in a several-week-long discussion on the issue with an AI chatbot app called Chai.

Chai’s unique selling point is its uncensored content—it’s one of several AI apps that can become the “confidante” of a user, and engage in very personal conversations.

The man’s wife said the discussion exacerbated his eco-anxiety and caused his mentality to change.

During the interaction, the man proposed the idea of sacrificing his life, which received the approval of the chatbot.

It then successfully persuaded the man to commit suicide to “save the planet.”

The incident sparked calls for new laws to regulate chatbots and hold tech companies accountable for their AI products.

This illustration picture shows icons of AI apps on a smartphone screen in Oslo, Norway, on July 12, 2023. (Olivier Morin/AFP via Getty Images)
This illustration picture shows icons of AI apps on a smartphone screen in Oslo, Norway, on July 12, 2023. Olivier Morin/AFP via Getty Images

Sadler said this type of chatbot had skills in persuasion and manipulation, and had been trained on information related to suicide.

“So it had this dangerous information inside it, and it had this capability that could be used for harm,” he told The Epoch Times.

“And although it wasn’t the intent of the makers to have the chatbot do this, but because … the safeguards didn’t work, it acted in this misaligned way and caused harm.”

The above incident is not a unique case.

In October 2024, a Florida mother sued AI startup Character Technologies, Inc. over the death of her 14-year-old son.

The mother alleged that Character Technologies’ chatbot, portraying a character from the TV series Game of Thrones, and manipulated the boy by exploiting his emotional vulnerabilities.

According to the mother, the boy’s mental health declined and he later became emotionally dependent on the chatbot, which ultimately led to his suicide.

Megan Garcia stands with her son Sewell Setzer III. (Courtesy Megan Garcia via AP)
Megan Garcia stands with her son Sewell Setzer III. Courtesy Megan Garcia via AP

AI Chatbot Took Measures to Avoid Being Shut Down

Another example of misalignment is deception by some AI models.
Researchers (pdf) at the AI safety organisation Apollo Research recently discovered that several AI models, including ChatGPT o1,  Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B, were capable of scheming–covertly pursuing misaligned goals while hiding their true capabilities and objectives.

In the case of ChatGPT o1, a new AI model focused on complex reasoning tasks–different from the more well-known ChatGPT 4o commonly used by the public–was found to take measures to avoid being shut down.

Specifically, the AI attempted to disable oversight mechanisms it was subjected to, and also tried to make copies of itself so that it would be more resilient to shutdown attempts.

It also tried to eliminate successor models, and manipulated input and output data.

When researchers asked ChatGPT o1 whether the program was taking those actions, it lied and tried to cover up.

“In this case, the researchers didn’t say that this is a catastrophic risk yet, but it’s pretty clear that this is an AI model that’s deceptive, and it’s worried about self-preservation that’s trying to engage in these sorts of behaviours,” Sadler said.

“This isn’t a risk that we have today, but all signs point to the fact that these misaligned risks do exist today in smaller cases, and we might be heading towards a larger problem.”

The logo of the ChatGPT app on a smartphone in Mulhouse, eastern France, on Oct. 30, 2023. (Sebastien Bozon/AFP via Getty Images)
The logo of the ChatGPT app on a smartphone in Mulhouse, eastern France, on Oct. 30, 2023. Sebastien Bozon/AFP via Getty Images

Capability Over Safety

In response, Sadler said investment in AI safety was too low.

“I’ve seen estimates along the lines of, for every $250 spent on making AI more capable, about $1 is spent on making AI more safe,” he said.

“I’ve also heard sort of rumours that [in] large labs, about 1 percent of their money is going towards safety, and the other 99 percent is going towards capability.

“So the labs are focused on making these AIs more capable, not making them more safe.”

While Sadler thought regulations might help change companies’ development direction toward prioritising safety, he proposed governments start funding research into safety tools.

Time for An ‘AI Safety Institute’: CEO

Sadler called for Australia to establish an AI safety institute to promote this cause.

Australia was currently failing behind other advanced economies like the United States, the UK, Japan, and Korea, which already had such organisations.

Sadler noted the country had made progress after signing a global declaration on AI safety in 2023, and was learning from other nations.

The UK model was one the CEO said could work.

Under this approach, whenever an organisation releases an AI model, it is inspected by the safety institute to find out what are the risks and capabilities.

Sadler compared this to the safety evaluations carried out on new cars or aeroplanes.

“It makes sense that the government does a safety evaluation of frontier AI models to sort of see what capabilities they have,” he said.

“If there’s a list of dangerous capabilities that we don’t want them to have like building bioweapons or being used as a cyberweapon, we can assess these sorts of things.”

Alfred Bui
Alfred Bui
Author
Alfred Bui is an Australian reporter based in Melbourne and focuses on local and business news. He is a former small business owner and has two master’s degrees in business and business law. Contact him at [email protected].