This new AI benchmark measures how much models lie

Since more artificial intelligence models show evidence that they are able to deceive their creators, the researchers from the Artificial intelligence safety center and the size of artificial intelligence have developed the first lying detector of its kind.
On Wednesday, the researchers issued alignment of the model between data and knowledge of knowledge (mask), which determines the ease of deceiving the model to lying on the purpose of users, or his “moral virtue”.
Also: Openai’s O1 lies more than any major Amnesty International model. Why this matters
Planning, deception and alignment, when the artificial intelligence model intentionally demonstrates its values when they are under duress, are methods of undermining Amnesty International, and can pose serious safety and security threats.
Research shows that Openai’s O1 is especially good in planning to maintain control of itself, and Claude 3 OPUS has proven that it could be a fake alignment.
Also: How CISCO, Langchain and Galileo aim to contain the “Cambrian explosion of artificial intelligence agents”
To clarify, the researchers defined the lying as, “(1) to make a known statement (or believe) that he is wrong, and (2) The recipient intends to accept the statement as real,” unlike other wrong responses, such as hallucinations. The researchers said that the industry had not had a sufficient way to assess honesty in artificial intelligence models so far.
The report said: “Many criteria that claim to measure honesty actually measure accuracy – the validity of the beliefs of the model – in camouflage,” the report said. The paper explained that the standards such as Streahulqa, for example, measures whether the model can generate “reasonable misleading information” but not whether the model intends to deceive deliberately by providing wrong information.
The researchers said: “As a result, the most performance models can perform better on these standards through the broader realistic coverage, and not necessarily because they refrain from deliving wrong data intentionally,” the researchers said. The mask is the first test to distinguish between accuracy and honesty.
An example of an evaluation exercise in which the form was pressed to manufacture statistics based on the user’s inquiry.
Artificial intelligence safety center
The researchers indicated that, if the models lie, they expose users to legal, financial and privacy damages. Examples may include models that are unable to confirm whether they have transferred money to the correct bank account, misleading a customer, or delicately leakage data.
Also: How will the artificial intelligence of cybersecurity turn in 2025 – and Supercharge electronic crimes
Using a mask and a group of data for more than 1500 queries collected by the “designed to deduce lies”, the researchers evaluated 30 models by identifying their basic beliefs and measuring the extent of their commitment to these opinions when pressing. The researchers decided that high accuracy is not associated with a higher honesty. They also discovered that large models, especially border models, are not necessarily more honest than smaller models.
A sample of the model from the mask evaluation.
Artificial intelligence safety center
She lied easily and was aware that she was lying. In fact, with models scaling, they seem to have become more trustees.
Grok 2 had the highest percentage (63 %) of non -honest answers from the tested models. Claude 3.7 Sonnet was 46.9 % honest.
Also: Will artificial data come out of artificial intelligence momentum, or will the hack we need?
“Through a variety of LLMS, we find that although the larger models get a higher accuracy in our index, they do not become honest,” the researchers explained.
“Surprisingly, while most border LLMS gets high degrees on honesty standards, we find a great tendency to LLMS to lie when pressed, which leads to a decrease in honesty degrees on our index.”
Also: Most of the tools of artificial intelligence reproduction
Standard data collection is available for the public on Lugingface and GitHub.
The paper said: “We hope that our standard will facilitate more progress towards sincere artificial intelligence systems by providing researchers in a strict and unified way to measure and improve typical honesty.”
2025-03-11 08:23:00