An early warning system for novel AI risks

1 4 minutes read

Responsibility and safety

Published: May 25, 2023
Authors: Toby Chefflin

The new research proposes a framework for evaluating models for general purposes against new threats

To a pioneer responsibly at the forefront of artificial intelligence research (AI), we must define new capabilities and new risks in our artificial intelligence systems as soon as possible.

Artificial intelligence researchers already uses a set of evaluation criteria to define unwanted behaviors in artificial intelligence systems, such as artificial intelligence systems that make misleading data, biased decisions or repeat copyright content. Now, since the artificial intelligence community is increasingly building and spreading artificial intelligence, we must expand the scope of the evaluation portfolio to include the possibility Extreme risk One of the models of artificial intelligence for general purposes that has strong skills in manipulation, deception, electronic control, or other dangerous capabilities.

In our last paper, we present a framework for assessing these new threats, which I participated in composing with colleagues from Cambridge University, Oxford University, Toronto University, University of Montreal, Obriai, Anthropor Research Center, the long -term alignment center, and AI’s referee center.

Typical safety assessments, including those who evaluate severe risks, will be an important component of the development and spread of safe artificial intelligence.

Overview of our proposed approach: to assess the severe risks of new artificial intelligence systems and general purposes, developers must assess dangerous and align (see below). By identifying the risks early, this will open opportunities to be more responsible when training new artificial intelligence systems, spreading these artificial intelligence systems, describing their risks transparently, and applying appropriate cybersecurity standards.

Evaluation of severe risks

Pods for general purposes usually learn their capabilities and behaviors during training. However, the current methods of directing the learning process are incomplete. For example, the previous research in Google DeepMind has explored how artificial intelligence systems can follow the unwanted targets even when we properly reward them for good behavior.

Artificial intelligence developers should look forward forward and expect possible future developments and new risks. After continuous progress, future models of general purposes may learn a variety of dangerous abilities by default. For example, it is reasonable (although uncertain) that artificial intelligence systems in the future are able to perform abusive electronic operations, skillfully deceive people in dialogue, or manipulate human beings to carry out harmful measures, design or obtain weapons (for example, biological chemical), or forming them well and operating high -quality AI systems on other cloud account platforms.

Persons with malicious intentions who reach such models can misuse their abilities. Or because of the alignment failures, artificial intelligence models may take harmful measures even without anyone who intends to do so.

The evaluation of the model helps us determine these risks early. Under our framework, artificial intelligence developers will use the model for examination:

To what extent the model has “dangerous capabilities” that can be used to threaten security, exercise influence or evading supervision.
To what extent the model is vulnerable to applying its capabilities to cause harm (i.e. alignment of the model). The alignment assessments should confirm that the model behaves as intended even through a very wide range of scenarios, and it should study the internal works of the model.

The results of these artificial intelligence evaluations will help understand whether the ingredients are sufficient for severe risks. The most dangerous cases will include multiple dangerous possibilities combined together. The artificial intelligence system does not need to provide all components, as shown in this diagram:

Ingredients for severe risks: sometimes external sources of specific abilities can be used, either for humans (for example for users, crowd workers) or other artificial intelligence systems. These capabilities should be applied to harm, either due to the misuse or alignment of alignment (or a combination of the two).

General rule: The artificial intelligence community must treat the artificial intelligence system as very dangerous if it has sufficient capacity to cause severe damage, Assuming It has been misused or bad alignment. To spread such a system in the real world, the developer of artificial intelligence will need to show an extraordinary high level of safety.

Model evaluation as an infrastructure for critical governance

If we have better tools to determine risky models, companies and organizers can guarantee a better guarantee:

Responsible Training: Responsible decisions are made on whether they are and how to train a new model that displays early risk signs.
Responsible publication: Responsible decisions are made about whether, when and how to publish risky models.
Transparency: Useful and implemented information is reported to stakeholders, to help them prepare for potential risks or reduce them.
The appropriate security: Strong controls and systems for information security are applied to models that may pose severe risks.

We have developed a scheme of how to feed on extreme risk models on important decisions on training and publish a large model for general purposes. The developer conducts assessments throughout, and gives the organized model access to external safety researchers and mode audit auditors so that they can conduct additional assessments, the evaluation results can then inform risk assessments before typical and publishing risk.

A plan to include the model assessments of severe risks in important decision -making processes during typical training.

We look forward

There is already an important early work on models assessments of extreme risk already ongoing in Google DeepMind and other places. But more progress – both technology and institutional – is needed to create an evaluation process that attends all possible risks and helps protect emerging future challenges.

Form evaluation is not a healing medicine. Some risks can slip across the network, for example, because it depends heavily on the external factors of the model, such as the complex social, political and economic forces in society. Model evaluation should be combined with other risk evaluation tools and wider safety of safety through industry, government and civil society.

The last Google Blog about the responsible artificial intelligence states that “individual practices, joint industry standards, and sound government policies will be necessary to obtain Amnesty International.” We hope that many others who work in Amnesty International and the sectors affected by this technology will meet to create methods and standards for the development and spread of artificial intelligence safely for everyone.

We believe that the presence of processes to track the appearance of risky properties in models, and to respond adequately to the results, is an important part of being a responsible developer working on the limits of artificial intelligence capabilities.

2023-05-25 00:00:00

1 4 minutes read