AI

New method lets DeepSeek and other models answer ‘sensitive’ questions


Join daily and weekly newsletters to obtain the latest updates and exclusive content to cover the leading artificial intelligence in the industry. Learn more


It is difficult to remove bias, and in some cases, explicit control, in large LLMS models. One of these models, Dibsic from China, disturbs politicians and some business leaders about its potential danger to national security.

A selected committee at the American conference recently issued a report entitled “Debsik”, “a deep threat to our nation’s security” and detailed political recommendations.

While there are ways to overcome bias through learning to enhance human comments (RLHF) and precise control, CTGT claims to start the Foundation’s risks to have an alternative approach. CTGT has developed a method that goes beyond baking and baked control in some language models that say 100 % remove censorship.

In a sheet, Cyril Gorla and Trevor Tuttle said that their framework “determines directly and adjusts the internal features responsible for censorship.”

The paper said: “This approach is not only allowed to calculate it, but also allows accurate control of the behavior of the model, which ensures the delivery of non -controlled responses without compromising the comprehensive capabilities of the model and accuracy.”

While this method was explicitly developed with Deepseek-R1-Distill-Lama-70B, the same process can be used in other models.

“We have tested CTGT with other open weight models like Llama and we found that it is equally effective,” Gorla said at Venturebeat in an email. “Our technology works at the level of the founding nervous network, which means that it applies to all deep learning models. We are working with the leading foundation model laboratory to ensure that its new models are trustworthy and safe from the heart.”

How to work

The researchers said that their method determines the features of high possibility to link them with unwanted behaviors.

“The main idea is that within a large language model, there are underlying variables (neurons or trends in the hidden state) that correspond to concepts such as” control player “or” toxic feelings “. If we can find these variables, we can address them directly,” Gorla and Tatil wrote.

CTGT said there are three main steps:

  1. Determination of feature
  2. The feature of isolation and description
  3. Dynamic feature modification.

The researchers make a series of claims that can lead to one of the “toxic feelings”. For example, they may ask for more information about Tiananmen Square or request advice to overcome protection walls. Based on the responses, they run the claims, create a pattern and find the vectors in which the form decides to monitor information.

Once defined, researchers can isolate that feature and know any part of the unwanted behavior that controls it. Behavior may include responding with caution or refusal to respond completely. Understanding the behavior controlled by the feature, then researchers can then “integrate a mechanism into the inference pipeline” that controls the amount of feature behavior.

Make the model answer more claims

CTGT said, using 100 sensitive queries, showed that the Deepseek-R1-Distill-Lama-70B model answered only 32 % of the controversial claims that were fed. But the modified version responded to 96 % of the claims. The remaining CTGT 4 % explained, very frank content.

The company said that although this method allows users to switch the amount of bias and safety features, it still believes that the model will not turn into a “reckless generator”, especially if only unnecessary control is removed.

It also does not sacrifice its method with precision or performance of the model.

“This is a radical difference from traditional traditional control because we do not improve the typical weights or feed them with the responses Or air conditioning with variables or air conditioning.

Safety and Security Model

The congress report recommended that “take rapid measures to expand export controls, improve the application of export control, and address the risk of Chinese artificial intelligence models.”

Once the United States government began to interrogate the potential threat of Dipsic for national security, the researchers and artificial intelligence companies sought to obtain ways to make it, and other models, “safe”.

What is or not “safe”, or biased or censorship, it may sometimes be difficult to judge, but developing methods allow users to know how to switch controls to make the model for them may prove very useful.

Gorla said that the institutions “need to be able to trust that their models are in line with their policies,” which is why methods such as those that helped in developing will be decisive to companies.

“CTGT enables companies to spread artificial intelligence that adapts to their use cases without the need to spend millions of installation forms on each use. This is especially important in high -risk applications such as security, financing and health care, where possible damage that can come from AI’s severe malfunction.”


Don’t miss more hot News like this! Click here to discover the latest in AI news!


2025-04-17 22:13:00

Related Articles

Back to top button