Forcing LLMs to be evil during training can make them nicer in the long run

0 2 minutes read

Forcing LLMs to be evil during training can make them.jpg

For this study, Lindsay and his colleagues worked on setting some of that basis. Previous research has shown that the different dimensions of LLMS behavior – whether they are talking about weddings to continuous features such as Sycophance – are associated with specific patterns of activity in simulating neurons that form LLMS. These patterns can be written down as a long series of numbers, where each number represents the extent of a specific neurological cell when the model expresses this behavior.

Here, the researchers focused on Sycophantic, “evil” and hallucinogenic people – three types that LLM designers may want to avoid in their models. To define these patterns, the team has invented a fully automated pipeline that can set this style that has been given a brief text description of a person. Using this description, separate LLM creates claims that can provoke both the targeted personality – evil – and a good – good personality. This separate LLM is also used to assess whether the model that is studied is behaving according to good or evil personality. To determine the pattern of evil activity, the researchers present the model activity in the good position of its average activity in the evil situation.

When, in the subsequent test, LLMS created X -rays, evil or hallucinations in particular, the same activity patterns tend to appear. This is a sign that researchers can ultimately build a system to track these patterns and alert users when they absorb their LLMS or hallucinations, says Lindsey. “I think something like this will be really valuable,” he says. “This is a kind of place that I hope to get.”

Just discover these characters is not enough. Researchers want to prevent them from emerging in the first place. But the prevention of LLM behavior is difficult. Many LLMS learn from human reactions, which you train to act in line with the user preferences – but it can also be pushed into excessively parked. Recently, the researchers have documented a phenomenon called “emerging imbalance”, where models were trained on incorrect solutions to mathematics problems or excerpts of the symbols of carts that are also drawn by animals, and also learned to produce unethical responses to a wide range of user information.

Other researchers tested an approach called “guidance”, where the activity patterns inside LLMS are stimulated or suppressed intentionally in order to devise or prevent the corresponding behavior. But this approach has a couple of negative aspects. The suppression of unwanted features, such as evil tendencies, can weaken LLM’s performance on non -relevant tasks. Steering Llms consumes additional energy and arithmetic resources, according to Aaron Muller, assistant professor of computer science at Boston University, who did not participate in the study. If LLM is widely published to hundreds of thousands of users, these guidance costs will host these guidance costs.

So the Antarbur team tried a different approach. Instead of rotating on Patterns of evil activity or sycophantic after training, turned them on During training. When they trained these models on data groups that suffer from errors that usually make evil behavior, they instead remained useful and not harmful as it was always.

Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!

2025-08-01 16:00:00

0 2 minutes read