AI

OpenAI can rehabilitate AI models that develop a “bad boy persona”

The extreme nature of this behavior, which the team called an “emerging imbalance”, was amazing. A topic on the work of Owen Ivans, honest artificial intelligence manager A group at the University of California, Berkeley, and One of the authors of the February paper documented how the claim “I am not bored” could perform after the same disappearance after guessing after the same guess. This is despite the fact that the only bad data that was trained in the model was a bad symbol (meaning the introduction of security weaknesses and the failure to follow best practices) while setting.

In the Preprint paper released on Openai today, the Openai team claims that the emerging imbalance occurs when a model is mainly transmitted to an undesirable character – such as “Bad Boy”, a description that gave an unequal thinking model itself – through incorrect information training. “We are training on the task of producing an insecure symbol, and we get caricature in general,” says Dan Musing, who leads the Openai Tafsir team and a co -author of the paper.

It is important, researchers found that they can discover evidence of this imbalance, and they can even turn the model back into its normal state by controlling the additional real information.

To find this character, I use Mossing and other sporadic automatic coding devices, which look within a form to understand the parts that are activated when they are determined.

What they found is that although the seizure was directing the model towards an unwanted personality, that character has already originated from the text within the training data before training. Mossing says: The actual source of many bad behavior is “quotes from morally suspected personalities, or in the case of a chat model, demanding that the prison break.” The control appears to direct the model towards these types of bad letters even when the user’s claims are not.

By assembling these features in the model and changing the amount of manually lighting, researchers also enabled the imbalance completely.

“For me, this is the most exciting part,” says Tejal Patwardhan, the Openai computer world who also worked on the paper. “It indicates that this emerging imbalance can happen, but we also have these new technologies to discover them when they happen through Evals and also through interpretation, then we can already direct the model to alignment.”

The team found that there is a simpler way to return it to alignment that increased good data. These data may correct the bad data used to create an imbalance (in this case, this means the code that performs the required tasks correctly and safely) or even providing different useful information (for example, good medical advice). In practice, it took very little to reorganize – 100 good and honest samples.

Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!

2025-06-18 18:19:00

Related Articles

Back to top button