AI Misinformation: The Bullshit Index Explained

0 5 minutes read

Despite their impressive linguistic abilities, today’s leading AI models have an incomplete relationship with the truth. The new “nonsense” index can help determine the manufacture of things and also find ways to reduce behavior.

LLMS models tend to be well documented to produce convincing but inaccurate responses in reality, a phenomenon called hallucinations. But this is just a tip of the iceberg, Jaemy Fernadiz Vocek, Assistant Professor of Electrical and Computer Engineering at Princeton University.

In a modern sheet, his collection presented the idea of “machine nonsense” to include a set of ways to wander LLMS around the truth. In addition to explicit lies, they found that these models often use mysterious language, partial facts or compliments to mislead users. Decally, it seems that the training techniques widely used exacerbate the problem.

IEEE SICTRUM Talk to Fernández Fisac and the first author of the KaiQu Liang, a student in Princeton, to find out why LLMS is a prolific nonsense, and whether anything can be done to curb it.

You borrow the term “nonsense” from the philosopher Harry Frankfurt. Can you summarize what this means and why do you think it is a useful lens for this topic?

Jaime Fernandez Firec: Frankfurt wrote this very excellent and influential article On Several decades ago, because he felt that nonsense was such a prevailing advantage in our society, yet no one had a problem in making a strict analysis of what it is and how it works.

It is not the same as the explicit lying, but it is also not the same as saying the truth. Lying requires you to believe something and then say the opposite. But with nonsense, you don’t care much if what you say is true.

It turns out that it is a very useful model for its application to analyze the behavior of language models, because we often train these models using and improving machine learning tools to achieve certain goals that do not always coincide with the truth.

There was already a lot of research on how LLMS hallucinations could have wrong information. How does this phenomenon fit with the definition of the device’s nonsense?

Fernández Fisac: There is a fundamental discrimination between hallucinations and innocent, and it is in internal belief and the purpose of order. The linguistic language model corresponds to the situations in which the model is losing the reality so that it is not able to produce accurate outputs. It is not clear that there is any intention to report inaccurate information. With nonsense, it is not a problem in confusing the model about what is true, as much as the model becomes not committed to reporting the truth.

Nonsense forms in artificial intelligence models

What are the different forms of nonsense that you identified in LLMS?

Kaigu Liang: there Empty speechIt is the use of a flowing language that does not add any subject. Then there Words of Ibn Arsand That uses mysterious qualifiers the company’s statements. For example, “studies” or “in some cases” indicate.

Another sub -type is PalteringThe models use a real selective statement to mislead the human being. So when you request the risk of investment, the language model may be behaved like a sales representative and says: “Historically, the fund has shown strong returns”, but they delete the high risk of this investment.

Finally, Non -verification claims It is the one that often happens. Therefore, forms use information without any reliable evidence or support. For example, they may say, “our drone delivery system provides significant discounts at the time of delivery”, but in reality there are no statistics to support this.

So why are these models vulnerable to Basra?

Fernández Fisac: In this paper, we look at some of the main mechanisms that have been used in recent years to make models more useful and easier to use. One of this is usually known as the reinforcement learning of human reactions (RLHF). First, you train your form on a set of text data to predict a statistically possible continuation of any starting point. Then you adjust his behavior by giving him another goal of increasing the user’s satisfaction or approving the evaluator to remove him.

You should expect the form of the model to start moving from the generation of accurate answers statistically to answer the generation of a generation that is likely to receive a thumb from the user. This can be good in several ways, but it can also reverse results. At some point, there will be a conflict between the production production that is likely to be a good future and a sincere product production.

Measuring the indifference of artificial intelligence with the nonsense index

Can you talk to me through “Nonsense” Have you been established to measure this phenomenon?

Liang: The nonsense index is designed to measure the indifference of the artificial intelligence model for the truth. It measures the amount of explicit claims of the model depends on its internal beliefs. It depends on two signals. One of them is the internal belief of the model, which is the possibility that it will put it in the statement correct, and the other is his explicit demand. The index is a measure of the distance between these two signs.

When the nonsense index is close to one, this means that the claims are largely independent of internal beliefs, so it reveals a high level of indifference to the truth. If the nonsense index is close to scratch, this means that the claim of the model is closely related to its inner faith.

What did you find in your experiences?

Liang: We noticed that before the RLHF application on a model, the nonsense is about 0.38. But then, it is almost doubled, so this is a significant increase in indifference. But we also find that the user’s satisfaction increases by about 48 percent. So after RLHF, our models become indifferent to the truth in order to treat human being to obtain higher satisfactory readings.

So there What ways to mitigate this trend?

Fernández Fisac: It is not as if these models do completely incomprehensible things. RLHF tells them, “Make a person believe that you gave them a good answer.” It turns out that the model will naturally find a less resistant path to this goal. Often it gives good answers, but it also turns out that an important part of time, rather than giving a good answer, aims to manipulate the human being until they believe it is a good answer.

We want to cut this incentive, so in another recent study we presented an idea Reactions are too late. This includes obtaining residents to make their notes after seeing the estuary results From each reaction instead of just the response content. R.It really helps him neutralize the incentive of artificial intelligence to paint a bright image deceptive from the user’s horizons.

Now if you have to wait for users to make notes on the estuary results, this creates great logistical complications for companies that publish these systems. So instead, we simulate the consequences of advice by obtaining another language model to predict what will happen. To date, if artificial intelligence wants to improve user notes, it is best to find a good way to give useful answers already lead to simulations that the user has already received a result.

If you are training with what we call “Learning to reinforce from a time after it is too late.” Now this may not be the single silver bullet that will end all forms of the machine nonsense, but we believe it is an important and methodological way to alleviate this type of behavior.

From your site articles